This is a fairly large data set indeed. The memory overhead (which is probably something like 4-8x for hash maps?) can start to become fairly noticeable at those sizes.
Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.
Since Wikipedia posts already have a canonical numeric ID, if map semantics are important, I'd probably load that mapping into memory and use something like roaringbitmap for compressed storage of relations.