I learned how expensive hashmaps and hashsets are through Wikipedia dumps. I did...

marginalia_nu · on Nov 15, 2023

This is a fairly large data set indeed. The memory overhead (which is probably something like 4-8x for hash maps?) can start to become fairly noticeable at those sizes.

Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.

wenyuanyu · on Nov 15, 2023

Sort them, and use a vector of vectors for the adjacency list... Or better still use a graph processing library or graph database to manage that for you...

kuschku · on Nov 15, 2023

How'd the hashset exceed your laptop memory, if the whole dump is just 22GB? You should be able to fit the entire dataset in RAM.

marginalia_nu · on Nov 15, 2023

That is a compressed dump you are looking at. The uncompressed data is much larger. Link graphs in general can grow quite big. Also, not every laptop has 32 GB RAM.

johnnyanmac · on Nov 15, 2023

I'm still sticking with 16GB on my laptop so that would exceed my current RAM. That may also cut close for a 32GB machine anyway, since the OS and other programs may not let you access all your physical RAM.

teaearlgraycold · on Nov 15, 2023

I think my laptop had 8GB at the time

angrais · on Nov 15, 2023

Why did lists require less memory? Was it because you only held a subset of keys in the lists?

teaearlgraycold · on Nov 15, 2023

Lists in Python have a integer for their size and a pointer for each element. Sets presumably have some number of buckets that are used to put pointers in, but many more buckets are allocated than get used in small sets.