Hacker News new | past | comments | ask | show | jobs | submit login

This Nim program might clarify some things for someone.

    import std/[random, math, sets, times]; type F = float
    
    template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped =
      ## Return (unbiased estim. cardinality of `it`, 1-alpha CI ebound*(1 +- e)).
      var x1 = initHashSet[type(xfm)](k); var x  = x1.addr
      var x2 = initHashSet[type(xfm)](k); var y  = x2.addr
      var p  = 1.0          # p always 2^-i means could ditch FP
      var n  = 0            #..RNG & arith for bit shifts/masks.
      let t0 = epochTime()
      for e in it:
        inc n   #; let e = xfm # to compute randState.next here
        x[].excl e          # 'd be nice to reduce to 1 hash by
        if rand(1.0) < p:   #..excl w/prob 1 - p but then cannot
          x[].incl e        #..add new keys.  p -> 0 anyway.
        while x[].len == k: # KnuthLoop not CVMIf guards against
          for e in x[]:     #..unlikely case of freeing nothing.
            if rand(1.0) < 0.5: y[].incl e
          p *= 0.5
          swap x, y; y[].clear # ↓ e.bound conservative by ~15X
      (int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F),
       (epochTime() - t0)*1e9/n.F)
    
    when isMainModule:  # Takes nItem/trial dupMask space nTrial
      when defined danger: randomize()
      import std/[strutils, os, strformat]
      let n   =  if paramCount()>0: parseInt(paramStr(1)) else: 100000
      let msk = (if paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64
      let k   =  if paramCount()>2: parseInt(paramStr(3)) else: 512 # 32KiB
      let m   =  if paramCount()>3: parseInt(paramStr(4)) else: 35
      var ex  = initHashSet[uint64]()
      var xs  = newSeq[uint64](n)
      for j in 1..m:
        xs.setLen 0; ex.clear
        for i in 1..n: xs.add randState.next and msk
        for e in xs: ex.incl e
        let (apx, err, t) = uniqCEcvm(xs, uint64, k)
        let ap = apx.F; let an = ex.len.F
        echo ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f} {t:.1f}ns"
First few lines of laptop output:

    51160 49152 eb: 0.6235 |x-a|/x: 0.0392 10.5ns
    51300 51840 eb: 0.6235 |x-a|/x: 0.0105 10.7ns
    51250 55168 eb: 0.6235 |x-a|/x: 0.0764 10.9ns
    51388 52352 eb: 0.6235 |x-a|/x: 0.0188 10.5ns
Ditching the floating point RNG & arithmetic might be able to lower that time per element cost to more like 5 ns or maybe 2GiB/s for 8B ints. The toggle back & forth between old & new HashSet is just to re-use the same L1 CPU cache memory.

Note also that the error bound in the CVM paper seems very conservative. My guess is that it could maybe be tightened by 15+X (at the scale of those example numbers), but that might also require some kind of special function evaluation. Anyone have any ideas on that?

Also note that since this is just estimating unique identities, you can always just count unique 64-bit hashes of keys (or maybe even 32-bit hashes depending on your accuracy needs and cardinalities) if you care about the key space/key compares are slow.




Frankly, who can read this!? I am not sure what's worse, the multi-line comments spanning multiple lines of code, having multiple instructions on a single line, or the apparent disconnect between the pseudo-code of the article.


I would blame the majority of your criticism on the fact that HN is not the best place to read code. Also, syntax highlighting & basic familiarity with Nim helps.

His code is doing a few more things than necessary. The actual algorithm is inside the `uniqCEcvm` template. The `it` it receives is anything you can iterate over (a collection or an iterator). Multiple things in one line really only appear where they directly relate to the part at the beginning of the line.

The `when isMainModule` is Nim's way of Python's `if __name__ == "__main__"`. The entire part below that is really just a mini CL interface to bench different (random) examples. Final thing to note maybe, the last expression of a block (e.g. of the template here) will be returned.

And well, the style of comments is just personal preference of course. Whether you prefer to stay strictly below 80 cols or not, shrug.

I grant you that the usage of 2 sets + pointer access to them + swapping makes it harder to follow than necessary. But I assume the point of it was not on "how to write the simplest looking implementation of the algorithm as it appears in the paper". But rather to showcase a full implementation of a reasonably optimized version.

Here's a version (only the algorithm) following the paper directly:

    proc estimate[T](A: seq[T], ε, δ: float): float =
      let m = A.len
      var p = 1.0
      var thr = ceil(12.0 / ε^2 * log2(8*m.float / δ))
      var χ = initHashSet[T](thr.round.int)
      for i in 0 ..< m:
        χ.excl A[i]
        if rand(1.0) < p:
          χ.incl A[i]
        if χ.card.float >= thr:
          for el in toSeq(χ): # clean out ~half probabilistically
            if rand(1.0) < 0.5:
              χ.excl el
          p /= 2.0
          if χ.card.float >= thr:
            return -1.0
      result = χ.card.float / p


Thank you very much for having taken the time.. Your comments and function are both very helpful!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: