Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cross posting my comment from reddit[1] because I think it's interesting.

-----

Nice post. I love calling attention to this. Just a few months ago, I ran into the ~same~ similar problem, although it wasn't caused by `collect()`. It was caused by "normal" `Vec` usage:

https://github.com/BurntSushi/aho-corasick/commit/474393be8d...

The issue appeared when building large Aho-Corasick automatons. Otherwise, it usually doesn't matter too much because the absolute memory sizes are pretty small. But if your automaton uses 1GB of memory, then because of the excess capacity in a bunch of little `Vec` values, that usage could easily balloon to 2GB. And even at that size, _it was hard to notice that it was more than it should have been_! It is easy to just think that's normal memory usage. Especially when it's coming from the automaton representation that you know is less memory efficient. (What I'm talking about here is something I called a "non-contiguous NFA." The aho-corasick crate also has a "contiguous NFA" and a DFA. The "contiguous NFA" uses one contiguous `Vec` allocation with a bunch of bit-packing tricks.)

But then someone actually reported an issue on the `ahocorasick_rs`[2] Python project (bindings to the `aho-corasick` crate) that the `pyahocorasick`[3] Python package (with a bespoke C implementation of Aho-Corasick) was using _substantially_ less memory. The contiguous NFA was still doing better, but the _peak_ memory usage of `ahocorasick_rs` was much much bigger than `pyahocorasick`. That prompted me to investigate and figure out what the heck was going wrong. (And this is one reason why benchmarks comparing tools or libraries are actually useful beyond just advertisements or flexing. They alert you to what is possible, and thus possibly push you to find bugs in your current implementation that might be otherwise easy to miss. In other words, benchmark comparisons can turn unknown unknowns into known unknowns.)

So since this prompted me to look very carefully at memory usage, I noticed that the memory usage reported by `AhoCorasick::memory_usage`[4] was substantially smaller than peak RSS memory usage in a simple reproduction program of the original issue reported to `ahocorasick_rs`. I eventually figured out that was because of the excess capacity used by a zillion tiny `Vec`s in the non-contiguous representation. I fixed _most_ of that by calling `shrink_to_fit()`. But as the commit linked above notes, it wasn't really feasible to call `shrink_to_fit()` on all `Vec`s because that ended up with a non-trivial impact on construction time. And on top of all of that, memory usage was still worse than `pyahocorasick`.

But why was I using a bunch of little `Vec`s in the first place? It's because this is the "non-contiguous" NFA. This is the thing that gets built from a list of patterns by first building a trie then filling in the failure transitions. That trie construction really demands random access mutation, which puts a constraint on your data structure choices. Plus, I had designed the contiguous NFA as a release valve of sorts that lifts this constraint. So I reasoned, "sure, the non-contiguous NFA isn't designed to be fast. It just needs to get us started." But still, `pyahocorasick` only has one Aho-Corasick implementation (the `aho-corasick` crate has 3). So it must be doing something differently.

It turns out that `pyahocorasick` uses linked lists! THE HORROR! But actually, they are a really good solution to this problem. As a Rust programmer, I reach for `Vec` first. But a C programmer reaches for linked lists first. And it turns out that linked lists are a much better fit here.

And so, I switched to linked lists[5]. (But using indices instead of pointers. Because of course. This is Rust. :-)) And now `ahocorasick_rs` uses less memory than `pyahocorasick`!

[1]: https://old.reddit.com/r/rust/comments/199jycb/identifying_r...

[2]: https://pypi.org/project/ahocorasick-rs/

[3]: https://pypi.org/project/pyahocorasick/

[4]: https://docs.rs/aho-corasick/latest/aho_corasick/struct.AhoC...

[5]: https://github.com/BurntSushi/aho-corasick/commit/31bb1fc30d...



> Just a few months ago, I ran into the same problem, although it wasn't caused by `collect()`. It was caused by "normal" `Vec` usage

If it wasn't caused by `collect()`, then I suppose it's a related problem, but not the same problem. Your problem was caused by, as your commit message says, "ignorance [original: ignorant] of the broader effects of this strategy [double-when-full] in aggregate"

The OP's problem, otoh, is more about reusing memory that has been allocated when you "convert" (in Rust term: into_iter()) a Vec to another Vec. What's more, the reuse also happens even when the new Vec has a different type from the original Vec's type. This behavior is more surprising than the double-when-full strategy, which is more widely known by programmers (even if sometimes they "forget" about the implications).


I feel like second sentence you quoted clarified that? "same problem" in this context means, "there is excess and unused capacity in the Vec." Your description is more precise, and at that level of precision, sure, they are not the "same problem." But yet, they share the same characteristic that is easy to forget: a `Vec` might have unused capacity leading to higher memory usage than you might expect otherwise if you aren't accounting for that unused capacity.

My phrasing is a reflection of what I think is a more universal lesson to draw from the blog, and one that is not provoked merely by how `collect()` works, but rather, the nature of growth amortization itself.

> This behavior is more surprising than the double-when-full strategy

I agree.

I updated my comment to use "similar" instead of "same."


What we need is a page-based Vec that mmaps (anon) for the storage but leaves the unused portions zero-bytes and therefore not part of RSS until actually required. (And when clearing/shrinking sections, madvise DONTNEED the pages).

That is, the vec could expand to areas much larger than the actual used size, but this would have no effect on process RSS until those pages get dirtied with actual data.


That can be not a great idea for subtle reasons even though it does seem like a better design at first.

When you do the madvise, based on the contact of the API, the kernel has to do a TLB shoot down which is really expensive. And not just “expensive for my process” but expensive in terms of a significant slowdown for entire machine. Honestly you could probably DDOS a machine if you could reliably trigger that shootdown.

As such you want to be very very careful where you place that data structure.

For what it’s worth your memory allocator (or at least a good one like mimalloc or the non-gperftools newer tcmalloc) will do exactly as you mentioned where it will free memory back to the OS using advanced heuristics tested out at scale.

As for why a shoot down is needed, it’s so that a racing allocation call in another process doesn’t get that virtual address but be running on a core where the TLB points elsewhere (the core that did the allocation may not even be the one where it’s used).


Yes, well, I think you're right it's a lot about the contract of the API and the expectation of the user.

I'd certainly want intelligence (either by the user or by the framework) in how frequently you release pages back to the OS.

But the DONTNEED is really not the core of the value. It's being able to create vectors that don't fragment as badly.

And yes, you're right a decent allocator could help with this, and what I'm describing is really just a kind of specialized allocator.


I don’t follow. Fragmentation wasn’t the problem here - it’s that an optimization to trying to reuse the underlying allocation resulted in a severe over-allocation in a specific path. That would happen even with your idea unless you use DONTNEED to release all the unused pages.

I think fragmentation is an over focused on problem when in practice it’s rarely the problem. It’s also a problem that can be solved by an application reset and making sure to periodically reset your running application in a way that doesn’t result in severe disruption is a better practice that works around many more issues than just fragmentation.


Is MADV_FREE on Linux any better in this regard? It's what the libc allocators tend to use, if I recall correctly.


It would have to be the same problem because there’s no way to free memory without a shootdown AFAIK. As I describe elsewhere, I don’t think there’s any way to free memory to the OS without one as otherwise you can run into race conditions where another process allocating memory gets access to the memory being freed in another process through a stale TLB entry.

Here’s a discussion [1] of a hypothetical lazy_munmap option that would initiate the unmap without an immediate shoot down (i.e. the CPUs would lazily evict from the TLB when they notice the request) but no such option exists. It’s also not immediately clear such a hypothetical API could exist on current CPU architectures or if it would require new HW support as I don’t have enough hands-on expertise at that level. Anyway, [1] is an interesting discussion of this idea and the note about huge pages making these even less valuable is interesting to keep in mind.

[1] https://news.ycombinator.com/item?id=23216590


Couldn't this obfuscate OOM type problems by not triggering actual memory issues until much later than the allocation?


Maybe? Depends on how you monitor and what your expectations are. But for workloads that use a lot of memory, and want to avoid wasteful empty allocations, or have sparse data, etc. it's a reasonable strategy.


Is this not going to be more or less what the memory allocator backing Vec does under the hood anyway?


That's already what it does (except when shrinking)


Just a somewhat tangent thought: do you know of any tricks to speed up the construction phase of Aho Corasick? A recent problem of mine needed to use Aho Corasick but with several million strings in the dictionary it took a bit longer than I thought to construct it. Any tricks you know of to speed up construction? Don't particularly care about memory.


Hmmm. The phrasing of your question is a little ambiguous. I could interpret it two ways:

1. "Do you know if there are any config knobs in the aho-corasick crate that will speed up construction?"

2. "I have my own implementation of Aho-Corasick and construction isn't as fast as I hoped. What kinds of tricks can I employ to make my own implementation faster?"

I'd be happy to try and answer either of these, but I think it would be better to ask with a concrete example on the issue tracker Discussions: https://github.com/BurntSushi/aho-corasick/discussions

All of the tricks I know about (and I judged to be worth doing) are in the aho-corasick crate. There's really just no getting around the fact that you need to first build a trie (which requires visiting every byte in every pattern) and then do a breadth-first search across the entire trie to setup failure transitions. Making construction substantially faster probably requires some kind of innovation in those two parts at a fundamental level.

You can also cheat at this. That is, by designing your Aho-Corasick automaton to be zero-copy deserializable. The DFAs in regex-automata support this. I waver on whether to do this for aho-corasick, because it acts like a straight-jacket for evolving the internals of the library.


Thanks for the great write up. Your writing and deep dives never fail to enlighten and entertain at the same time.


What’s the reason for not always using the contiguous variant?


You can't build the contiguous variant directly from a sequence of patterns. You need some kind of intermediate data structure to incrementally build a trie in memory. The contiguous NFA needs to know the complete picture of each state in order to compress it into memory. It makes decisions like, "if the number of transitions of this state is less than N, then use this representation" or "use the most significant N bits of the state pointer to indicate its representation." It is difficult to do this in an online fashion, and likely impossible to do without some sort of compromise. For example, you don't know how many transitions each state has until you've completed construction of the trie. But how do you build the trie if the state representation needs to know the number of transitions? Classic chicken-and-egg.

Note that the conversion from a non-contiguous NFA to a contiguous NFA is, relatively speaking, pretty cheap. The only real reason to not use a contiguous NFA is that it can't represent as many patterns as a non-contiguous NFA. (Because of the compression tricks it uses.)

The interesting bits start here: https://github.com/BurntSushi/aho-corasick/blob/f227162f7c56...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: