In my experience, when using Django or one of the other WSGI-based Python web frameworks, the steps to complete a complex request are serialized just as much as in Rails. The single-threaded process-per-request model, based on the hope that requests will finish fast, is also quite common in Python land.
You mention that GC is a nightmare for high-percentile latency. Isn't this just as much of a problem for Go? Would you continue to develop back-end services in C++ if not for the fact that most developers these days aren't comfortable with C++ and manual memory management?
For my own project, the GC tradeoff with Go (or Java) is acceptable given the relative ease of development w.r.t. C++. Since there are better structures in place to explicitly control the layout of memory, you can do things with freepools, etc, that take pressure of GC.
For high-performance things like the systems I had to build at Google, I don't know how to make things work in the high percentiles without bringing explicit memory management into the picture. Although it makes me feel like a hax0r to talk about doing work like that, the reality is that it adds 50% to dev time, and I think Go/Clojure/Scala/Java are an acceptable compromise in the meantime.
It is possible to build things that minimize GC churn in python/ruby/etc, of course; I don't want to imply that I'm claiming otherwise. But the GC ends up being slower in practice for any number of reasons. I'm not sure if this is true in javascript anymore, actually... it'd be good to get measurements for that, I bet it's improved a lot since javascript VMs have received so much attention in recent years.
Final point: regardless of the language, splitting services out behind clean protobuf/thrift/etc APIs is advantageous for lots of obvious reasons, but one of them is that, when one realizes that sub-service X is the memory hog, one can reimplement that one service in C++ (or similar) without touching anything else. And I guess that's my fantasy for how things will play out for my own stuff. Ask me how it went in a couple of years :)
Just to be clear, do you mean that writing in C++ and doing manual memory management doubles dev time, or makes it 1.5 times as long as it would be in a garbage collected language?
Also, where does most of that extra dev time go? Being careful while coding to make sure you're managing memory right, or debugging when problems come up?
I don't think that doing manual memory management doubles dev time for experienced devs, no... I just mean that, if you're trying to eliminate GC hiccups by, say, writing a custom allocator in C++ (i.e., exactly what we had to do with this project I was on at Google), it just adds up.
I.e., it's not the manual memory management that's expensive per se, it's that manual memory management opens up optimization paths that, while worthwhile given an appropriately latency-sensitive system, take a long time to walk.
In my experience, when using Django or one of the other WSGI-based Python web frameworks, the steps to complete a complex request are serialized just as much as in Rails. The single-threaded process-per-request model, based on the hope that requests will finish fast, is also quite common in Python land.
You mention that GC is a nightmare for high-percentile latency. Isn't this just as much of a problem for Go? Would you continue to develop back-end services in C++ if not for the fact that most developers these days aren't comfortable with C++ and manual memory management?