Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I learned how expensive hashmaps and hashsets are through Wikipedia dumps. I did some analysis of the most linked-to pages. Countries were among the highest. Hash sets for holding outgoing edges in the link graph ended up causing my program to exceed my laptop’s memory. Plain old lists (Python) were fine, though. And given there aren’t a crazy number of links per page using lists is fine performance wise.


This is a fairly large data set indeed. The memory overhead (which is probably something like 4-8x for hash maps?) can start to become fairly noticeable at those sizes.

Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.


Sort them, and use a vector of vectors for the adjacency list... Or better still use a graph processing library or graph database to manage that for you...


How'd the hashset exceed your laptop memory, if the whole dump is just 22GB? You should be able to fit the entire dataset in RAM.


That is a compressed dump you are looking at. The uncompressed data is much larger. Link graphs in general can grow quite big. Also, not every laptop has 32 GB RAM.


I'm still sticking with 16GB on my laptop so that would exceed my current RAM. That may also cut close for a 32GB machine anyway, since the OS and other programs may not let you access all your physical RAM.


I think my laptop had 8GB at the time


Why did lists require less memory? Was it because you only held a subset of keys in the lists?


Lists in Python have a integer for their size and a pointer for each element. Sets presumably have some number of buckets that are used to put pointers in, but many more buckets are allocated than get used in small sets.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: