This Nim program might clarify some things for someone.
import std/[random, math, sets, times]; type F = float
template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped =
## Return (unbiased estim. cardinality of `it`, 1-alpha CI ebound*(1 +- e)).
var x1 = initHashSet[type(xfm)](k); var x = x1.addr
var x2 = initHashSet[type(xfm)](k); var y = x2.addr
var p = 1.0 # p always 2^-i means could ditch FP
var n = 0 #..RNG & arith for bit shifts/masks.
let t0 = epochTime()
for e in it:
inc n #; let e = xfm # to compute randState.next here
x[].excl e # 'd be nice to reduce to 1 hash by
if rand(1.0) < p: #..excl w/prob 1 - p but then cannot
x[].incl e #..add new keys. p -> 0 anyway.
while x[].len == k: # KnuthLoop not CVMIf guards against
for e in x[]: #..unlikely case of freeing nothing.
if rand(1.0) < 0.5: y[].incl e
p *= 0.5
swap x, y; y[].clear # ↓ e.bound conservative by ~15X
(int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F),
(epochTime() - t0)*1e9/n.F)
when isMainModule: # Takes nItem/trial dupMask space nTrial
when defined danger: randomize()
import std/[strutils, os, strformat]
let n = if paramCount()>0: parseInt(paramStr(1)) else: 100000
let msk = (if paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64
let k = if paramCount()>2: parseInt(paramStr(3)) else: 512 # 32KiB
let m = if paramCount()>3: parseInt(paramStr(4)) else: 35
var ex = initHashSet[uint64]()
var xs = newSeq[uint64](n)
for j in 1..m:
xs.setLen 0; ex.clear
for i in 1..n: xs.add randState.next and msk
for e in xs: ex.incl e
let (apx, err, t) = uniqCEcvm(xs, uint64, k)
let ap = apx.F; let an = ex.len.F
echo ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f} {t:.1f}ns"
Ditching the floating point RNG & arithmetic might be able to lower that time per element cost to more like 5 ns or maybe 2GiB/s for 8B ints. The toggle back & forth between old & new HashSet is just to re-use the same L1 CPU cache memory.
Note also that the error bound in the CVM paper seems very conservative. My guess is that it could maybe be tightened by 15+X (at the scale of those example numbers), but that might also require some kind of special function evaluation. Anyone have any ideas on that?
Also note that since this is just estimating unique identities, you can always just count unique 64-bit hashes of keys (or maybe even 32-bit hashes depending on your accuracy needs and cardinalities) if you care about the key space/key compares are slow.
Frankly, who can read this!? I am not sure what's worse, the multi-line comments spanning multiple lines of code, having multiple instructions on a single line, or the apparent disconnect between the pseudo-code of the article.
I would blame the majority of your criticism on the fact that HN is not the best place to read code. Also, syntax highlighting & basic familiarity with Nim helps.
His code is doing a few more things than necessary. The actual algorithm is inside the `uniqCEcvm` template. The `it` it receives is anything you can iterate over (a collection or an iterator). Multiple things in one line really only appear where they directly relate to the part at the beginning of the line.
The `when isMainModule` is Nim's way of Python's `if __name__ == "__main__"`. The entire part below that is really just a mini CL interface to bench different (random) examples.
Final thing to note maybe, the last expression of a block (e.g. of the template here) will be returned.
And well, the style of comments is just personal preference of course. Whether you prefer to stay strictly below 80 cols or not, shrug.
I grant you that the usage of 2 sets + pointer access to them + swapping makes it harder to follow than necessary. But I assume the point of it was not on "how to write the simplest looking implementation of the algorithm as it appears in the paper". But rather to showcase a full implementation of a reasonably optimized version.
Here's a version (only the algorithm) following the paper directly:
proc estimate[T](A: seq[T], ε, δ: float): float =
let m = A.len
var p = 1.0
var thr = ceil(12.0 / ε^2 * log2(8*m.float / δ))
var χ = initHashSet[T](thr.round.int)
for i in 0 ..< m:
χ.excl A[i]
if rand(1.0) < p:
χ.incl A[i]
if χ.card.float >= thr:
for el in toSeq(χ): # clean out ~half probabilistically
if rand(1.0) < 0.5:
χ.excl el
p /= 2.0
if χ.card.float >= thr:
return -1.0
result = χ.card.float / p
Note also that the error bound in the CVM paper seems very conservative. My guess is that it could maybe be tightened by 15+X (at the scale of those example numbers), but that might also require some kind of special function evaluation. Anyone have any ideas on that?
Also note that since this is just estimating unique identities, you can always just count unique 64-bit hashes of keys (or maybe even 32-bit hashes depending on your accuracy needs and cardinalities) if you care about the key space/key compares are slow.