Very impressive. I wonder how Go manages to even stop all threads in 100 microse...

johncolanduoni · on Oct 29, 2016

Other systems I'm aware of that are capable of similar feats (I know .NET's collector can, not sure about Hotspot's) use page poisoning. Basically the compiler/JIT puts a bunch of extraneous memory accesses to a known address at key points where the GC can track what all variables and registers hold (called "safepoints"), in a fairly liberal pattern around the code. When the GC wants to stop all the threads in their tracks, it unmaps the page holding that address, causing segfaults in every thread it it hits one of the safepoints. The GC hooks the segfault handler, and simply waits for all threads to stop.

I'm not sure that's what Go does, but I'm not aware of a faster way on standard hardware.

pcwalton · on Oct 29, 2016

That's basically what Go does; it folds the preemption points into the stack checks during function prologs.

menzoic · on Oct 29, 2016

Where did you learn that? (curious to learn more)

ci5er · on Oct 29, 2016

I haven't seen it written up, but it's discussed some here:

  - https://github.com/golang/go/issues/10958

hinkley · on Oct 29, 2016

hopefully in back branches too, right?

pcwalton · on Oct 29, 2016

Not according to this bug. :( https://github.com/golang/go/issues/10958

geodel · on Oct 29, 2016

Java 7 and above provide XX:MaxGCPauseMillis flag for GC conf

From Mechanical Sympathy Blog [1]

" G1 is target driven on latency –XX:MaxGCPauseMillis=<n>, default value = 200ms. The target will influence the amount of work done on each cycle on a best-efforts only basis. Setting targets in tens of milliseconds is mostly futile, and as of this writing targeting tens of milliseconds has not been a focus of G1. "

So in Java world we are talking 100s of milli seconds of worst case which is 3 order of magnitude higher than Go.

1. http://mechanical-sympathy.blogspot.com/2013/07/java-garbage...

pcwalton · on Oct 29, 2016

Well, yeah. You're comparing the G1 (for server workloads) to Go's concurrent collector. There's a concurrent collector you can use in incremental mode for the JVM if you want to trade throughput for pause time, like Go does.

The HotSpot designers have (correctly, IMO) observed that server use cases typically prefer throughput to latency.

On a meta note, I wish people would focus on the actual algorithms instead of taking cited performance numbers out of context and comparing them head to head. The Go GC uses the same concurrent marking and sweeping techniques that CMS does. This change to avoid stop the world stack sweeps is something that no HotSpot collector does. But it's counterbalanced by the fact that Go's GC is not generational. Generational GCs change the balance significantly by dramatically improving throughput.

sambe · on Oct 29, 2016

In my experience you are still getting single-to-double-digit millisecond pauses on average using CMS (even with some attention/tuning). Do you really think Hotspot can offer a GC where pauses longer than 100us are considered interesting enough to look into?

pcwalton · on Oct 29, 2016

Sure, if they implement barriers for stack references like Go is. But that's a significant throughput hit.

sambe · on Oct 29, 2016

My original intention was to ask if you think that's currently achievable. But also interesting for the future.

zigzigzag · on Oct 30, 2016

Well, probably not.

The question is what use cases can tolerate large throughput hits but not few msec pause times (G1 can also do pauses of a few msec in many cases, I see hundred msec pause times on very large heaps, but not desktop sized heaps).

I suspect there are very few use cases. The G1 team seems to be focusing on scaling to ever larger heaps right now, like hundreds of gigabytes in size. They're relatively uninterested in driving down pause times as Shenandoah is going to provide that for the open source world and Azul already does for the $$$ world.

geodel · on Oct 29, 2016

Unless you have numbers to share for CMS latency in Java I have no reason to assume that they are materially different from G1. I am using CMS for my server applications and multi second latency is quite common STW pauses in CMS.

In general Oracle JDK uses order of magnitude more memory and order magnitudes higher GC latency compare to Go. IMO it is quite useful to remember when deciding on language on new projects. If hotspot people think that these numbers are not true I am sure they would let people know.

the8472 · on Oct 29, 2016

* Heap size and stack sizes × number of threads matters. JVMs can manage hundreds of OS threads with deep stacks and 100GB+ heaps.

* Go's GC is completely non-compacting while G1 is for all generations and CMS for all but the old generation.

* Shenandoah will offer lower pause times by doing concurrent compacting, at the expense of throughput.

* Azul already offers a pauseless compacting collector for hotspot

geodel · on Oct 29, 2016

Here are Shenandoah pause numbers 11]. Max pause is 45ms which I would agree is ultra low pause for Java. Because 100s of ms pause is common for by Java server applications. Here are GC numbers for Go >= 1.6 for 100GB+ [2]. Go 1.7/1.8 have/going to have lower numbers than that.

1. http://www.slideshare.net/RedHatDevelopers/shenandoah-gc-jav...

2. https://talks.golang.org/2016/state-of-go.slide#37

the8472 · on Oct 29, 2016

A big difference is that Shenandoah still does compacting while gogc does not. That also means it can do bump-pointer allocation, i.e. it has a faster allocation path.

geodel · on Oct 29, 2016

There must be some end user (positive) impact of memory compaction in Java. But I do not see in benchmarks where Java programs takes 2-10 times more memory and runs slower than Go.

http://benchmarksgame.alioth.debian.org/u64q/go.html

zigzigzag · on Oct 30, 2016

The end user benefit is stability. A runtime that compacts the heap cannot ever die due to OOM situation caused by heap fragmentation, whereas if you don't compact then the performance of an app can degrade over time and eventually the program may terminate unexpectedly.

the8472 · on Oct 29, 2016

Those benchmarks are of limited use since the JVM has startup costs which get armortized over time. Full performance is only achieved after JIT warmup. AOT would improve that, but on openjdk that's experimental and few people use other JVMs that support AOT out of the box.

About the memory footprint: The runtime is larger, so there's a baseline cost you have to pay for the JITs. But baseline != linear cost.

If you have a long-running applications that run more than a few seconds and crunch more than a few MB of data those numbers would change.

So unless you're using the JVM for one-shot script execution those benchmarks are not representative. And if you do then there would be certain optimizations that one could apply.

> There must be some end user (positive) impact of memory compaction in Java.

Fewer CPU cycles overhead per unit of work, i.e. better throughput, which does seem to be an issue for gogc[0]. No risk of quasi-leaks in the allocator due to fragmentation. Reduced cache misses[1]

[0] https://github.com/golang/go/issues/14161 [1] http://stackoverflow.com/q/31225252/1362755

geodel · on Oct 29, 2016

> About the memory footprint: The runtime is larger, so there's a baseline cost you have to pay for the JITs. But baseline != linear cost.

I do not think so. Here is explanation about Java memory bloat in typically used data structures.

https://www.cs.virginia.edu/kim/publicity/pldi09tutorials/me...

the8472 · on Oct 29, 2016

That is true - although typed arrays will solve that in the future - but the benchmarks you cite (e.g. fannkuch-redux) still run into a memory floor created by the larger runtime.

Also, `regex-dna` will probably benefit from compact strings in java 9.

And some of the longer-running benchmarks are in fact faster on java, so I think that supports my point about warmup.

igouy · on Nov 2, 2016

> (e.g. fannkuch-redux) still run into a memory floor

Yes, apart from reverse-complement, regex-dna, k-nucleotide, binary-trees.

> Also, `regex-dna` will probably benefit from compact strings in java 9.

Do you actually know that com.basistech will re-write their library to use compact strings?

> I think that supports my point

I think that's called cherry picking.

igouy · on Nov 2, 2016

> Those benchmarks are of limited use since the JVM has startup costs which get armortized over time.

-- Benchmarks are of limited use.

-- Are JVM startup costs significant?

http://benchmarksgame.alioth.debian.org/sometimes-people-jus...

funny_falcon · on Oct 31, 2016

binary-trees benchmark is much faster on Java than on Go.

igouy · on Nov 2, 2016

> faster on Java than on Go

go version go1.7 linux/amd64

OP "Sub-millisecond GC pauses in Go 1.8"

matthewwarren · on Oct 29, 2016

Interesting enough .NET doesn't use the "memory access" trick, instead is uses either a direct poll against a variable, or for user code can sometimes just rudely abort the running thread (if it knows it's safe to do).

See http://mattwarren.org/2016/08/08/GC-Pauses-and-Safe-Points/ for all the gory details

pgwhalen · on Oct 29, 2016

Hotspot does that as well. Really nifty stuff.

geodel · on Oct 29, 2016

Hotspot worst case (non) guarantees are 100s of milli seconds not 100s of micro seconds.

pcwalton · on Oct 29, 2016

Again, you're comparing the server G1 GC (tuned for throughput) to Go's GC (tuned toward extreme latency guarantees).

inlined · on Oct 29, 2016

Is this a bad thing? You can just increase the number of front-end servers and have high throughput in a service with reliable tail latency.

heavenlyhash · on Oct 29, 2016

This. It's a lot easier to horizontally scale things with a lean towards consistently lower operational latency. You can keep raking in the benefits and cranking up throughput without a whole lot of thought.

It's much more expensive and complex to take an erratic latency operation and bring it down by throwing on more resources. As far as I can tell, the normal design course is making sure all your major actions are either pure or idempotent allowing parallel (and redundant!) requests to be made... which is a large (worthy, but large) engineering effort, and then we're talking about scaling to 2x or more just so you can make that redundant-request thing your default behavior.

jsiepkes · on Oct 29, 2016

In case of Java there is also a JVM implementation that does pauseless GC (zing): https://www.azul.com/products/zing/pgc/ .

For OpemJDK/Hotspot there will be a <100ms GC in Java 9: http://openjdk.java.net/jeps/189

geodel · on Oct 29, 2016

'Pauseless' is marketing term like those unlimited data plans from telecom/cable providers.

Here is how Zing use flags:

-XX:+UseGenPauselessGC - Generational Pauseless GC (GPGC) optimized for minimal pause times.

johncolanduoni · on Oct 29, 2016

You're right, Zing pause times are "only" uncorrelated with heap size or usage patterns.

avar · on Oct 29, 2016

Another approach you can use in some cases with the JVM, which is often the simplest, is to set up the JVM so it doesn't GC (give it a lot of memory), then either just spawn a new JVM to take over, or take the machine about to run a GC out of your load-balanced pool before running a full GC, then put it back in again.

Doing the manually triggered & staggered GC trick on a pool of machines you control can give you very low latency guarantees, since no production request will ever hit a GC-ing JVM.

kuschku · on Oct 29, 2016

"just"

You can also "just" remove servers that are currently in GC from your pool, and have high throughput in a service with reliable latency.

smarx007 · on Oct 29, 2016

How would you do that?

divan · on Oct 29, 2016

https://arxiv.org/pdf/1504.02578.pdf

mSparks · on Oct 29, 2016

Personally found Java's GC to be a little tricky but generally awesome.

However, 100, even 200ms pauses get "lost" during network coms. But users tend to notice if a page takes 30 seconds to load rather than 1 second (differences i've seen optimising Javas GC, don't know at all how Go and Java compare).

That kind of difference would be very expensive to solve using more servers.

pjmlp · on Oct 29, 2016

The biggest issue is on single deployments like desktop or embedded, hence why these kind of improvements are so relevant.

mSparks · on Oct 29, 2016

That's the key trade of these days in software imho. What kind of memory management you need.

Sever side zero memory leaks absolute requirement and "real time" responses rarely a requirement. triggering gc during init and/or shutdown of a component often enough.

Building a CNC machine - every tick is valuable, but as long as it can run a job for 2 days before crashing no one will notice if it leaks memory like a sieve when you run a calibration routine.

geodel · on Oct 29, 2016

From my knowledge G1 is for low latency by sacrificing some throughput. I do not know why you keep hammering a point what Oracle official documents do not claim.

https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc...

Here are relevant points:

1. If (a) peak application performance is the first priority and (b) there are no pause time requirements or pauses of 1 second or longer are acceptable, then let the VM select the collector, or select the parallel collector with -XX:+UseParallelGC.

2. If response time is more important than overall throughput and garbage collection pauses must be kept shorter than approximately 1 second, then select the concurrent collector with -XX:+UseConcMarkSweepGC or -XX:+UseG1GC.

johncolanduoni · on Oct 30, 2016

That's relative to other Hotspot collectors. It's still generational with bump-pointer allocation and lighter write barriers than Go, so it is still geared heavily towards throughput relative to Go's GC.

pgwhalen · on Oct 29, 2016

Sorry, I was referring to the implementation of safepoints, not GC latency.

pcwalton · on Oct 29, 2016

To stop the world, Go preempts at the next stack size check [1], which happens during function call preludes. I guess that happens often enough to be fast in practice.

[1]: https://github.com/golang/go/blob/8f81dfe8b47e975b90bb4a2f8d...

rayiner · on Oct 29, 2016

I assume this works so quickly in part because goroutines are backed by a pool of OS threads that don't block? So everybody doesn't get stuck waiting for a thread that's blocked to get around to checking the preemption flag?

pcwalton · on Oct 29, 2016

Right, blocked threads are stopped in the userspace scheduler so there are no synchronization issues with just keeping them stopped.

j1vms · on Oct 29, 2016

> (...) much less do any work in that time

100 microseconds is quite a long time in CPU time for a single-core these days, and proportionally longer with multi-core, or say in GPU time. However taking into account the VM runtime environment, this wouldn't make Go's feat any less impressive.

osi · on Nov 1, 2016

the 100 microseconds is the length of the pause after all threads are stopped.