I learned how expensive hashmaps and hashsets are through Wikipedia dumps. I did some analysis of the most linked-to pages. Countries were among the highest. Hash sets for holding outgoing edges in the link graph ended up causing my program to exceed my laptop’s memory. Plain old lists (Python) were fine, though. And given there aren’t a crazy number of links per page using lists is fine performance wise.
This is a fairly large data set indeed. The memory overhead (which is probably something like 4-8x for hash maps?) can start to become fairly noticeable at those sizes.
Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.
Sort them, and use a vector of vectors for the adjacency list... Or better still use a graph processing library or graph database to manage that for you...
That is a compressed dump you are looking at. The uncompressed data is much larger. Link graphs in general can grow quite big. Also, not every laptop has 32 GB RAM.
I'm still sticking with 16GB on my laptop so that would exceed my current RAM. That may also cut close for a 32GB machine anyway, since the OS and other programs may not let you access all your physical RAM.
Lists in Python have a integer for their size and a pointer for each element. Sets presumably have some number of buckets that are used to put pointers in, but many more buckets are allocated than get used in small sets.