Cross posting my comment from reddit[1] because I think it's interesting. ----- ...

beltsazar · on Jan 18, 2024

> Just a few months ago, I ran into the same problem, although it wasn't caused by `collect()`. It was caused by "normal" `Vec` usage

If it wasn't caused by `collect()`, then I suppose it's a related problem, but not the same problem. Your problem was caused by, as your commit message says, "ignorance [original: ignorant] of the broader effects of this strategy [double-when-full] in aggregate"

The OP's problem, otoh, is more about reusing memory that has been allocated when you "convert" (in Rust term: into_iter()) a Vec to another Vec. What's more, the reuse also happens even when the new Vec has a different type from the original Vec's type. This behavior is more surprising than the double-when-full strategy, which is more widely known by programmers (even if sometimes they "forget" about the implications).

burntsushi · on Jan 18, 2024

I feel like second sentence you quoted clarified that? "same problem" in this context means, "there is excess and unused capacity in the Vec." Your description is more precise, and at that level of precision, sure, they are not the "same problem." But yet, they share the same characteristic that is easy to forget: a `Vec` might have unused capacity leading to higher memory usage than you might expect otherwise if you aren't accounting for that unused capacity.

My phrasing is a reflection of what I think is a more universal lesson to draw from the blog, and one that is not provoked merely by how `collect()` works, but rather, the nature of growth amortization itself.

> This behavior is more surprising than the double-when-full strategy

I agree.

I updated my comment to use "similar" instead of "same."

cmrdporcupine · on Jan 18, 2024

What we need is a page-based Vec that mmaps (anon) for the storage but leaves the unused portions zero-bytes and therefore not part of RSS until actually required. (And when clearing/shrinking sections, madvise DONTNEED the pages).

That is, the vec could expand to areas much larger than the actual used size, but this would have no effect on process RSS until those pages get dirtied with actual data.

vlovich123 · on Jan 18, 2024

That can be not a great idea for subtle reasons even though it does seem like a better design at first.

When you do the madvise, based on the contact of the API, the kernel has to do a TLB shoot down which is really expensive. And not just “expensive for my process” but expensive in terms of a significant slowdown for entire machine. Honestly you could probably DDOS a machine if you could reliably trigger that shootdown.

As such you want to be very very careful where you place that data structure.

For what it’s worth your memory allocator (or at least a good one like mimalloc or the non-gperftools newer tcmalloc) will do exactly as you mentioned where it will free memory back to the OS using advanced heuristics tested out at scale.

As for why a shoot down is needed, it’s so that a racing allocation call in another process doesn’t get that virtual address but be running on a core where the TLB points elsewhere (the core that did the allocation may not even be the one where it’s used).

cmrdporcupine · on Jan 18, 2024

Yes, well, I think you're right it's a lot about the contract of the API and the expectation of the user.

I'd certainly want intelligence (either by the user or by the framework) in how frequently you release pages back to the OS.

But the DONTNEED is really not the core of the value. It's being able to create vectors that don't fragment as badly.

And yes, you're right a decent allocator could help with this, and what I'm describing is really just a kind of specialized allocator.

vlovich123 · on Jan 18, 2024

I don’t follow. Fragmentation wasn’t the problem here - it’s that an optimization to trying to reuse the underlying allocation resulted in a severe over-allocation in a specific path. That would happen even with your idea unless you use DONTNEED to release all the unused pages.

I think fragmentation is an over focused on problem when in practice it’s rarely the problem. It’s also a problem that can be solved by an application reset and making sure to periodically reset your running application in a way that doesn’t result in severe disruption is a better practice that works around many more issues than just fragmentation.

LegionMammal978 · on Jan 18, 2024

Is MADV_FREE on Linux any better in this regard? It's what the libc allocators tend to use, if I recall correctly.

vlovich123 · on Jan 18, 2024

It would have to be the same problem because there’s no way to free memory without a shootdown AFAIK. As I describe elsewhere, I don’t think there’s any way to free memory to the OS without one as otherwise you can run into race conditions where another process allocating memory gets access to the memory being freed in another process through a stale TLB entry.

Here’s a discussion [1] of a hypothetical lazy_munmap option that would initiate the unmap without an immediate shoot down (i.e. the CPUs would lazily evict from the TLB when they notice the request) but no such option exists. It’s also not immediately clear such a hypothetical API could exist on current CPU architectures or if it would require new HW support as I don’t have enough hands-on expertise at that level. Anyway, [1] is an interesting discussion of this idea and the note about huge pages making these even less valuable is interesting to keep in mind.

[1] https://news.ycombinator.com/item?id=23216590

cnity · on Jan 18, 2024

Couldn't this obfuscate OOM type problems by not triggering actual memory issues until much later than the allocation?

cmrdporcupine · on Jan 18, 2024

Maybe? Depends on how you monitor and what your expectations are. But for workloads that use a lot of memory, and want to avoid wasteful empty allocations, or have sparse data, etc. it's a reasonable strategy.

loeg · on Jan 18, 2024

Is this not going to be more or less what the memory allocator backing Vec does under the hood anyway?

glandium · on Jan 18, 2024

That's already what it does (except when shrinking)

kccqzy · on Jan 18, 2024

Just a somewhat tangent thought: do you know of any tricks to speed up the construction phase of Aho Corasick? A recent problem of mine needed to use Aho Corasick but with several million strings in the dictionary it took a bit longer than I thought to construct it. Any tricks you know of to speed up construction? Don't particularly care about memory.

burntsushi · on Jan 18, 2024

Hmmm. The phrasing of your question is a little ambiguous. I could interpret it two ways:

1. "Do you know if there are any config knobs in the aho-corasick crate that will speed up construction?"

2. "I have my own implementation of Aho-Corasick and construction isn't as fast as I hoped. What kinds of tricks can I employ to make my own implementation faster?"

I'd be happy to try and answer either of these, but I think it would be better to ask with a concrete example on the issue tracker Discussions: https://github.com/BurntSushi/aho-corasick/discussions

All of the tricks I know about (and I judged to be worth doing) are in the aho-corasick crate. There's really just no getting around the fact that you need to first build a trie (which requires visiting every byte in every pattern) and then do a breadth-first search across the entire trie to setup failure transitions. Making construction substantially faster probably requires some kind of innovation in those two parts at a fundamental level.

You can also cheat at this. That is, by designing your Aho-Corasick automaton to be zero-copy deserializable. The DFAs in regex-automata support this. I waver on whether to do this for aho-corasick, because it acts like a straight-jacket for evolving the internals of the library.

diarrhea · on Jan 18, 2024

Thanks for the great write up. Your writing and deep dives never fail to enlighten and entertain at the same time.

Filligree · on Jan 18, 2024

What’s the reason for not always using the contiguous variant?

burntsushi · on Jan 18, 2024

You can't build the contiguous variant directly from a sequence of patterns. You need some kind of intermediate data structure to incrementally build a trie in memory. The contiguous NFA needs to know the complete picture of each state in order to compress it into memory. It makes decisions like, "if the number of transitions of this state is less than N, then use this representation" or "use the most significant N bits of the state pointer to indicate its representation." It is difficult to do this in an online fashion, and likely impossible to do without some sort of compromise. For example, you don't know how many transitions each state has until you've completed construction of the trie. But how do you build the trie if the state representation needs to know the number of transitions? Classic chicken-and-egg.

Note that the conversion from a non-contiguous NFA to a contiguous NFA is, relatively speaking, pretty cheap. The only real reason to not use a contiguous NFA is that it can't represent as many patterns as a non-contiguous NFA. (Because of the compression tricks it uses.)

The interesting bits start here: https://github.com/BurntSushi/aho-corasick/blob/f227162f7c56...