Recycling memory buffers in Go

stcredzero · on Aug 26, 2013

Buffer caching, the technique that never goes away.

GC does not mean your memory management is solved. It means that your memory management works well enough for development and sometimes for undemanding tasks. That said, generational GC can be like mediocre buffer caching "for free." But this still doesn't make all explicit buffer caching go away.

Given that buffer caching is "the techique that never goes away," why isn't some facility for handling this built into the runtimes of languages with GC? You could tell the GC that this particular type of object needs to be cached, with a cache of this size and growth policy, and then programmers never have to write and debug code that does this.

EDIT: Another way to think abut this: because GC is designed to be universal and foolproof, it's easy to get GC into a corner case where its behavior is very suboptimal. In the case of "memory pressure" you can think of this as the GC having to scan the graph of references way more often than it really needs to. However, the optimization techniques for many of these are fairly generic. What if optimization techniques could be built into languages and runtimes, then instrumentation could be used to empirically determine what optimizations are actually needed, then these could be simply "switched on" for particular classes or types.

fiatmoney · on Aug 26, 2013

That shouldn't usually require any explicit support from the runtime; it can exist as library code (and does - for instance Apache Commons: http://commons.apache.org/proper/commons-pool/ ).

stcredzero · on Aug 26, 2013

> That shouldn't usually require any explicit support from the runtime; it can exist as library code

Well, of course. People writing their own logically implies that such can exist as library code. But if you build in support from the runtime, then it might be possible to squeeze out a bit more performance. For example, perhaps it's possible to tell the GC that certain objects don't have to be considered as roots. (And to the extent that this is possible with current libraries, it's because there is support built into the runtime.)

EDIT: Also, having these mechanisms as library code probably opens up more opportunities for introducing implementation errors.

jamesaguilar · on Aug 26, 2013

Generic buffer caching support looks an awful lot like generational GC. It's faster when you hand-roll it because you know how you're going to use the buffers and which buffers are going to be reused.

stcredzero · on Aug 26, 2013

I'm proposing something that's not quite as foolproof and automatic as generational GC, but more optimal and a lot quicker and safer than hand-rolling. Also something for which profiling is enough to "know how you're going to use the buffers and which buffers are going to be reused."

So basically, take something that's a programming problem and reduce it to "profile, then throw some switches."

jamesaguilar · on Aug 27, 2013

I guess I'm just not seeing how that's going to result in anything different than a userspace freelist library.

stcredzero · on Aug 27, 2013

If you do everything correctly, not too much. But you can't assume the former. Also, having the thing baked into the language/runtime will encourage use. It's along the lines of what happens with refactoring facilities that aren't part of the browser/editor/ide. They stay "that esoteric thing over there" and don't get used that much.

kenshiro_o · on Aug 27, 2013

Is the Go GC Collector not good enough so you have to resort to this? Would be good if you had the option (like in Java) to specify heap min and max sizes - and if set to the same value your heap would never resize.

zwieback · on Aug 26, 2013

Maybe I didn't read the code right but where does it check whether a retrieved buffer is the right size? The makebuffer() routine appears to generate random size buffers but that length then doesn't get checked.

My experience from embedded systems is that buffer rings are great if you can predict two or three common sizes that cover your typical allocation needs.

I think even for people that work mostly in GC environments it's a good idea to read through the basic mechanism of heap management and allocation strategies. This article is a good intro to the basic idea and I like the charting and analysis being done.

jgrahamc · on Aug 26, 2013

The sample code does not check the buffer size. This is a slightly artificial example to show the technique and have some interesting graphs to back it up.

jgrahamc · on Aug 26, 2013

There's a proposal to add something like this in Go itself: https://code.google.com/p/go/issues/detail?id=4720

stock_toaster · on Aug 26, 2013

unfortunately, it looks like it isn't going anywhere right now:

https://groups.google.com/d/topic/golang-dev/QtSIOFsNFrQ/dis...

thirsteh · on Aug 26, 2013

This is nice, but I can't help but feel it's a little wrong that we're still making free-lists like in the olden days.

tptacek · on Aug 26, 2013

You absolutely don't need to do any of this to write Go programs that outperform Ruby or Python.

thirsteh · on Aug 26, 2013

Generally, sure. But it's still pretty easy to write garbage-generating Go applications that exhaust memory because the GC isn't compacting. In some applications this is a nobrainer: use a free list. In some others, like when you don't actually know what kind of buffer size you'll need, it's not so easy (although you could argue you should just fixed-sized chunks for whatever you're doing, and recycle those, but that's a lot of work.)

What I'm saying boils down to: Free lists are definitely useful, but the GC could be better.

jbooth · on Aug 26, 2013

"Could be better" is of course true for every single situation, no matter how good.

But even Java, which has what is generally regarded as the most advanced GC ever written, still has lots of applications doing the free-list of ByteBuffers or even better, the free-list of off-heap-allocated DirectByteBuffers which don't have to be moved during tenured generation compactions.

It seems pretty reasonable to me to have one or two freelists at the spots you're frequently allocating 64kb buffers, along with using the GC for the other 5,000 allocations in your project.

thirsteh · on Aug 26, 2013

Sure. No disagreement.

Let me put it another way: Go's GC is still in its infancy compared to those in most other languages. It's perfectly fine to use, but there are some big things (e.g. no compaction) that stick out when using the language, whether you're working with free-lists or something else.

tptacek · on Aug 26, 2013

I think you're going to find that even with no attention to allocation at all, your Golang programs are going to outperform their Ruby and Python equivalents.

That's a dicier assertion to make versus other languages, notably Java, but if you're coming to Golang from a high level language like Python, you're probably going to be happy with the performance you get out of naive code.

thirsteh · on Aug 26, 2013

You're right (except for some cases like crypto and regex where other languages "cheat" by using C bindings or unstable versions), but that isn't where it has bit me. In general all my Go programs have been more than fast enough, but on several occasions I've had memory starvation problems that I would not be having in other languages (even if there is a way to do it without doing a lot of allocation and fragmentation in Go.)

I love Go, but "it's fast" isn't a good reason not to improve a subpar GC. (Which fortunately isn't a position the Go team is taking. I know everyone there would like to see a lot of interesting things happen with the GC. It just hasn't been one of the more pressing issues so far, and I've agreed. It's also a very difficult and "boring" thing to work on for most people, so it's not surprising that somebody hasn't just added an Azul C4-style concurrent GC in their spare time.)

pcwalton · on Aug 27, 2013

Azul C4 requires a kernel module or a specialized kernel to support all the MMU tricks that it relies on to achieve maximum performance. I don't think it's really an appropriate choice for general purpose GC in a language that's designed to be easy to deploy.

Java HotSpot's collector makes the most sense IMHO: stop the world compacting in the young generation, mostly concurrent in the tenured generation. It achieves better throughput than Azul C4 too, according to the C4 paper.

agentS · on Aug 27, 2013

I vaguely remember reading somewhere that they figured out a way to run their collector on stock Linux; I think the price they paid in exchange was (even) lower throughput (trying to track down reference so I know I wasn't daydreaming) (yes, I daydream about GCs).

A while ago, I read through the kernel patch they provided, and it was basically adding batch APIs for mremap so you don't have to have a TLB flush for every call. Also because all mutator threads needed to be paused at a safepoint while the remaps happen, so the batch API has much shorter pause time.

lucian1900 · on Aug 27, 2013

Neither of HotSpot's collectors guarantee maximum latency, though, which is Azul C4's interesting feature.

I would pay throughput to achieve (soft) real-time GC.

stcredzero · on Aug 27, 2013

> In general all my Go programs have been more than fast enough, but on several occasions I've had memory starvation problems that I would not be having in other languages

The problem with GC's languages, is that GC is "good enough" for prototyping and for undemanding stuff, but there are still these corner cases. Really, it's no different than other forms of memory management in this regard. It's just particularly beginner friendly.

aaronblohowiak · on Aug 26, 2013

Yes, Golang is very young and its gc is one of the more 'immature' portions. However, in comparison to the tools that it is regularly displacing the overall speed is still a huge win. Eventually, Golang will get a better gc and in the meantime we can already enjoy faster programs. Nobody is arguing that the gc of today should be Golang's gc forever.

wglb · on Aug 26, 2013

Have you noticed the starvation in 64 bit programs or 32 bit programs?

thirsteh · on Aug 26, 2013

Both. They weren't issues with precision/no gc at all, but that the arena contents became fragmented, as far as I could tell.

cbhl · on Aug 26, 2013

Just to clarify, by outperform, are you speaking wrt. memory usage?

If I recall correctly, there are still some very strong warnings against running Go on memory-constrained systems (< 1 GB).

bad_user · on Aug 27, 2013

People pick Ruby and Python without having unrealistic expectations for performance and throughput. We already know their implementations suck and being the high-level languages that they are, we know how difficult they are to optimize.

However, if performance is a concern, then picking Go seems to me like quite a step back from say Ruby.

There are other alternatives as well, that are higher level and that are much better at performance than Ruby/Python. E.g. Java, C#, Scala, Clojure, Ocaml, Haskell.

Personally I love Scala and the JVM, but the one thing I don't like is the GC. The JVM has the best GCs ever, but sometimes you just want to do without one at all.

This is why I don't like Go. While it's a tasteful language, it's too low level, but it still depends on a GC (that's not precise, compacting or generational).

Mozilla's Rust is much more promising for me.

cmbaus · on Aug 26, 2013

I agree, given the type of issues they are addressing (very high throughput application servers), I don't think a general purpose allocator (garbage collected or not) will ever get the performance they want. Recycling buffers is not a bad strategy for this type of application, no matter the language.

JulianMorrison · on Aug 26, 2013

When malloc() has to deal with defragmenting allocations of varying length, I believe it does it by allocating from several pools each holding blocks of a fixed size, picking the smallest category that will fit. It wouldn't be hard to extend the buffered-channel technique to do this.

zwieback · on Aug 26, 2013

The olden days are not that old. I find myself using static arrays, free lists, malloc and its OO brethren as well as GC when available. All of those things have been around for as long as I can remember.

It's nice not to have to worry about freeing allocated memory but if you write a lot of low-level code there's a certain level of unease about letting the language or OS take on something that can have such a large performance impact.

thrownaway2424 · on Aug 26, 2013

This technique puts a lock in your serving path which will necessarily limit parallelism. Go channel operations are not "non-blocking" in every sense. Even if you have them in a select block your thread has to take the mutex to find out if the buffer is empty (on receive) or full (on send).

felixge · on Aug 26, 2013

FWIW: Node.js uses a buffer cache like this by default: https://github.com/joyent/node/blob/master/lib/buffer.js

thirsteh · on Aug 26, 2013

It's a good thing Node.js doesn't have to worry about synchronization between threads... :)

felixge · on Aug 26, 2013

Would the down-voters care to explain? I've pointed out how another platform deals with allocating raw memory buffers as it's relevant to the discussion.

I've made no claims of it being superior, and in fact, am largely using Go myself these days.

thirsteh · on Aug 26, 2013

Using a shared buffer list isn't very efficient in a language with concurrency since you have to synchronize to take from/add to the list.

cmbaus · on Aug 27, 2013

All heap access must be synchronized. Using pooled buffers may actually be faster because you can ensure that only one thread is accessing a specific pool. Just create a new pool, when a new thread is created.

thirsteh · on Aug 27, 2013

Or, you know, each thread could allocate only when it needs something and handle its own recycling. Node.js has a particular advantage in that it isn't concurrent in any way (which, of course, has many other downsides.)

bad_user · on Aug 27, 2013

You can certainly build concurrent systems on top of Node, what it doesn't have is preemptive multi-threading, which sucks IMHO, but speaking of Go I also don't like that it does M:N multi-threading instead of 1:1 like the JVM, as the scheduling done is suboptimal.

GhotiFish · on Aug 27, 2013

Oh man, I'm pretty keen to write many types of subsystems, but thread scheduling?

I'll take stuff I don't want to mess with for 300, Alex.

thirsteh · on Aug 27, 2013

> You can certainly build concurrent systems on top of Node

You sound so confident. Do you have any example of a system in JavaScript that has concurrent semantics?

wzsddtc · on Aug 26, 2013

coincidentally …. this is how we pooled the buffers on Vine for Android to make Camera initializations efficiently.