One thing to keep in mind that their work-load is very ready-heavy which eases things a lot when scaling the system. The same is true for Wikipedia. Scaling a write-heavy workload is way more complex than scaling a read-heavy workload.
SO content changes all the time. Votes, comments, moderation, edits, tagging, search and recommendations, etc. There are also real-time community features. It's not as simple as it seems.
They log every single pageview to SQL Server. There are plenty of writes to various other counters, recommendation system, message inboxes, analytics, etc. Also the ads system, although I'm not sure how much of that is still 3rd-party.
Yes it's read-heavy but there's still plenty of work done in assembling a page. It's definitely not as simple as caching at the CDN edge for every hit.
IIRC most of the "real time" stuff is done on Redis, for which this QPS load is usually a joke (depending on requests of course, but counters etc are easy)
so, would you mind showing any non-FANG, but write heavy site? Besides that, even the operation of most social networks should be very easy to be split into town-sized instances with batch-updates in-between instances (see mastodon specs https://github.com/tootsuite/documentation/blob/master/Runni...). What actually costs a lot is constantly surveilling all user interactions, data-warehousing that and playing out ads. None of which is necessary, and a 20ct/per user/per month should cover the actual service nicely...
All that can be implemented as separate services. Doing that also enables graceful degradation, as a failure of, say, voting doesn't prevent me from reading the question.
A slim subset of humans typing things is not what I would label "write heavy". Write heavy is more like 100k+ devices out in the field sending their current position every 10s. That's still very manageable, but requires some thoughtful design.
Indeed and I literally was. And the back end was written in mod_perl (which isn’t exactly known for performance). Even back then 10k/s was our smallest service - both on income as well as traffic - so there wasn’t the incentive to rewrite it in a faster language.
The amount of data transferred is meaningless to compare. SO is a relatively complex site with dynamic content and real-time features.
Compared to other similar sites like Reddit or Quora which are far slower yet running on more hardware, it shows what proper efficient architecture and code can do.
Sure Reddit has more write activity but I don't see the number of replies being such a big factor. SO questions have answers, comments, tags, related questions, and many other secondary data to load. Reddit is slow due to poor architecture and a terrible frontend.
I recalled that back in the days when the new Reddit has been rolled out, there were quite a few days that it's difficult to open in mobile browser. Now it's much better.
SO is really not a great example of a high traffic site. 5500 req/sec of mostly read only traffic is not that crazy at all, and their hardware footprint is incredibly over-provisioned for the workload. I don't really think their example stands well here.
For example, at work, our entire analytics ingest workload (HTTP) for a few hundred million users runs on 8 core VMs on GCP, written in Rust/Go, each node doing ~40k events/second.
Our RTC infra sustains >1m PPS per node on 4 core 3.9ghz 2014-2017 xeons.
Hi,
I'm curious because I plan to rewrite a Rust service to Go (development velocity is too slow)
Which part of your service is in Rust amd which in Go?
Do you think that if it would be only in Go it could sustain such a load?
What specific ergonomics pitfalls did you experience in Rust? "Development velocity is too slow" is a bit vague; beyond the use of GC, which only really matters in specialized domains, there's not much reason to think that rewriting your service in Go would give you better 'development velocity'.
There is, though. Go was designed to be easier to reason with than conventional ALGOL-derivatives, while Rust wasn't.
To quote Rob Pike:
The key point here is our programmers are Googlers, they’re not researchers. They’re typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They’re not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt.
Go is aiming for the same niche that Python is. This has the accidental side effect of making it faster to write software in it than most of the ALGOL-derivatives. To quote Eric Raymond (sorry) on Python:
When you're writing working code nearly as fast as you can type and your misstep rate is near zero, it generally means you've achieved mastery of the language. But that didn't make sense, because it was still day one and I was regularly pausing to look up new language and library features!
Go and Python both allow for expression about as fast as you can type, by virtue of being designed for [children in Go's case, shell scripting in Python's case].
It's not a value judgement of Rust or anything, but Rust wasn't designed with the same goals in mind.
> Go was designed to be easier to reason with than conventional ALGOL-derivatives, while Rust wasn't.
This may be a problem of what's idiomatic in each language, as opposed to a matter of language design per se. After all, Rust development can be made at least as easy as, e.g. Swift, simply by adding enough uses of .clone() and RefCell<>. Is this suboptimal? Of course, but it will still be plenty faster than Python, and perhaps even faster than Go.
Compilation time is a separate issue which apparently OP found problematic. It's being dealt with (for non-release optimized builds) via the cranelift project, which is a Rust-specific backend much like the Go compiler, with no reliance on LLVM.
This may be a problem of what's idiomatic in each language, as opposed to a matter of language design per se.
It's absolutely a matter of language design. Again, I'm not criticizing Rust, but a language explicitly designed to be trivially written by anyone, fast, is going to be easier to write in quickly by anyone. Imagine making that comment about Logo instead of Go. Of course Logo is more painless to write than Rust! It's a child's language. So is Go!
Rust development can't be made as easy as Swift (and it is very unlikely that what you described would be faster than Go). Even Graydon Hoare agrees that Rust is inadequate in comparison to Swift in terms of development ease:
I'm no stranger to languages with vastly different idioms than normal (I write APL daily), but not all languages are as quick to develop in as every other, and pretending they are is silly. Rust has some innovations, and it's by no means a bad language in itself, but pretending it wins at everything under the sun (even things it's not trying to do) doesn't reflect reality or the perspective of the original author.
If you can write a ton of code without thinking very much, you're probably writing boilerplate that should have been generated from a human-level description of the problem. Your job is to only spend time writing what needs to be written.
Languages that allow you to write as fast as you think are a blessing.
"Write as fast as you think" is a far better way to program than "Write much slower than you can think."
Eric Raymond has written some pretty substantial things, and he's not as clueless (on programming, at least, the rest of his views are...no) as you're implying.
The idea that intuitive languages are the only ones you should do development in is absurd. A single line of K can do what a hundred lines of C can, and you can write the line of K substantially faster than you could write the C to match. K only has something like 50 primitives. It's simple enough that you can keep it all in your head at once, and that allows you to develop much quicker than almost any ALGOL-derivative. Taking your comment at face value, everything written must be boilerplate. Looking at reality paints a different picture.
Good languages manage complexity in a way that allow you to express complex things in simple terms. That the languages you seem to be familiar with only allow you to describe simple things in simple terms isn't something that's inherent to every programming language. I'd recommend giving APL, J, or K a try.
I'm all in favor of concise and expressive languages (even weird ones). I hate being slowed down by the language itself. But writing the first thing that comes into my head leads me to reinventing the wheel a lot, and (at least at work) we have a responsibility to find reusable abstractions and only create new code that's needed.
I disagree with that. If your code base is small enough, a few redundant lines (idioms) don't matter.
To bring up k again, the language has already done about the maximum amount of abstraction possible. There's really no room for the programmer to make reusable abstractions; any useful ones have already been made. That allows you to do very useful things in very small amounts of code. Picking a random example, here's a complete Sudoku solver in 75 bytes:
By development velocity I mean implementing new features, from idea to deployment.
Rust is slow to compile so it breaks my deep work when programming, also it costs me a lot in CI/CD.
Also the Rust type system make implementing some things really hard.
For example I wanted to implement json requests logging. It took me more than 1 day in Rust, less than 2 hour in go.
Any reason for such a low (<5%) average CPU usage? It seems like a waste of resources to me; that's assuming a "normal" CPU usage read that includes wait i/o time.
Stack overflow hosts all (most?) Of their own baremetal servers in their own data center.
Looking at the specs of the machines, they are actually pretty basic as far as servers go. A server isn't barely worth the cost of it's chassis and motherboard if you put less than 64 G ram and 24 CPUs in it.
In other words, these are about the lowest specd proper servers you can get. So yeah, even their modest hardware is still overspecd for running their website.
"Their own data center" implies (to me at least) that they built their own data center. That doesn't sound right, so I looked it up, and it seems like they're colocating. That might be what you meant, but the data centers they use certainly weren't built or owned by SO.
As CPU utilization goes up, latency will go up. Also, one or two second snapshots of CPU use are averages over time. During any given moment you will invariably get little bursts of requests, and those will have slow responses if running closer to full utilization.
Also, of course, the total cost per hour for 9 servers with 3 year depreciation is in cents.
They have a lot more CPU than they need, but the challenge is often getting servers shaped appropriately (RAM x CPU x disk x network), especially if you're not building your own.
Ultimately, their service is (most likely) not compute-bound.
Yes but the overall hardware is a one-time purchase and cheap considering all the other costs of the business. They're well provisioned to keep latency down and handle any unexpected outages.
600k open websocket connections means that 600k people opened stackoverflow sites but dont click on anything right?
Because they still only have 550req/s. Interesting how much more power you need just to keep track of state.
Many developers severely underestimate how much workload can be served by a single modern server and high-quality C++ systems code. I've scaled distributed workloads 10x by moving them to a single server and a different software architecture more suited for scale-up, dramatically reducing system complexity as a bonus. The number of compute workloads I see that actually need scale-out is vanishingly small even in industries known for their data intensity. You can often serve millions of requests per second from a single server even when most of your data model resides on disk.
We've become so accustomed to extremely inefficient software systems that we've lost all perspective on what is possible.
Can you expand on this? I have some pretty massive compute loads that need to be scaled onto a cluster with 100+ workers for most computations. This is after I use a library called dask that graphically does its own mapreduce optimisation inside its modules. This is all for a relatively small 250GB raw data file that I keep in a csv (and need to convert to SQL at some point).
Are you saying this can be optimised to fit inside a single 10 core server in terms of compute loads?
Don't know why you're being downvoted but I'll assume your question is genuine.
You use a cluster when your data and compute requirements are large and parallel enough that the tax paid on network latency trumps the 10-20X speedup you get on SSD and 1000X speedup you get from just keeping data in RAM.
250 Gigs is tiny enough that you could probably much get better performance running on high memory instance in AWS or GCP. You'll generally have to write your own multiprocessing code though which is fairly simple - your existing library may also be able to support it.
I once actually ran this kind of workload on just my laptop using a compiled language that performed better than pyspark on a cluster.
I'd love to keep it in RAM if I could. The problem is, the library I'm familiar with (pandas) typically seems to take more memory than the original csv file once it loads it onto memory. I know this is due to bad data types but in certain cases, I cannot get around those.
However, even if I could load it all into memory at once, and assuming it takes 200 gb, I'm still using a master's student access to a cluster. So I get preempted like it's nobody's business. Hence why I prefer a smaller memory footprint even if I take up cpus at variable rates through a single execution.
I did try to write my own multiprocessing code for this, but the operations are sometimes too complicated (like groupby) for me to rewrite everything from the ground up. If I'm not reliant on serial data communication between processes (like you'd need to sort a column), I can get it done pretty easily. In fact, I wrote my data cleaning code with this and cleaned up the entire file in half an hour because single chunks didn't rely on others.
However, if you have some idea of how to run these computational loads in parallel in python or any other language on single compute instances (like the size of a laptop's memory of 16 gb), I'd really love to see it. Thanks.
Numpy supports memory mapping `ndarrays` which can back a DataFrame in pandas. This lets you access a dataset far larger than will fit in RAM as if it lived in RAM. Provided it's on fast SSD storage you'll have speedy access to the data and can process huge chunks at once.
Can you provide a link to this please? My current knowledge is that all numpy data lives in memory, and pandas itself has a feature to fragment any data into iterables so I can read upto my memory limit. I cannot use this feature due to the serial nature of some of the operations that I alluded to (I'd have to almost rewrite the entire library for some of these complicated operations like groupby and sorting).
I do have fast SSD storage because it's on the scratch drive of a cluster and from what I've seen it can do ~300-400 MB/s easily. I haven't had a chance to test more than that since I'm mostly memory constrained in much of my testing.
My current attempt is to push this data into a pure database handling system like SQL so that I can query it. But like I said, I work with a less-than-stellar set of tools and I have to literally set up a postgres server from ground up to write to it. Which shouldn't be a big deal except when it's on a non-root user and I have to keep remapping dependencies (took 5-6 hours to set it up on the instance I have access to).
My other option was to write the entire 250 GB to a sqllite database using the sqlalchemy library in Python, but that seems to fail whether I do it with parallel writes, or serial writes. In both cases, it fails after I create ~64-70 tables.
You can create memory mapped ndarrays, these act like normal numpy arrays but don't need to fit into RAM. Numpy maps the array to a binary file on disk. The array otherwise acts like an ndarray so you can build a DataFrame with it. Whenever you access an array index Numpy in the background (essentially) seeks that many values into the file to grab the value of that index.
Since you're on a fast SSD and Numpy is fairly smart you'll be able to access your arrays close to your drive's speed. It's slower than if the whole database was in RAM but far faster than distributing the data over a network to a bunch of worker nodes. Memory mapped files let you have array-like access to data on disk as if it lived in RAM. When building a pandas DataFrame from a memmapped ndarray I believe you just need to set copy=False in the constructor for it to Just Work.
I don't know what your data looks like but I doubt loading it into SQLite is going to improve your performance.
Unfortunately my data isn't all numbers. It has text too. The sparse examples in that link only show this for reading in numbers. Do you know off hand if it translates well? There is a dtype parameter, but it'll take me a few days to get back to this code, so I figured I'd check beforehand.
Your second paragraph is essentially what I want. I'm willing to wait a day for code that may run in 1 hour from memory, so time isn't entirely an issue unless it's starting to bleed into weeks. The read_csv function in pandas has a parameter called memory_map, but when I tried using it on a smaller 7GB dataset, it read the whole thing into memory (32GB instance) even when I set it to True.
SQLite is definitely not my best option here. It was the only server-less implementation I could find, so I tried to use it and it didn't work. However, a database like implementation will be helpful because each operation I need to do will require data that satisfies certain timestamp and arithmetic conditions. I figured it'd be best to load the whole thing into a db and query it for every operation to train my model.
I'd spin up something like AWS r5.16xlarge node for the processing just for this and shut it down after use - should cost few 10s of dollars per run or so. Of course in some corporate environments, this option may not be available to you.
> Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
> cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
... re: the OT: While it's possible to write C++ code that's really fast, it's generally inflexible, expensive to develop, and dangerous for devs with experience in their respective domains of experience to write. Much saner to put a Python API on top and optimize that during compilation.
Unfortunately, I'm not sure what's wrong with dask, but it doesn't work properly on my cluster. I tested it on an exceedingly simple operation - find all unique values in a very big column (5 billion rows, but I know for a fact that there are only 500-502 unique values in there). With a 100 workers, it still failed. Now this is an embarrassingly parallel operation that can be implemented trivially. So I'm not sure if there's a problem with my cluster or if dask just does not work with slurm clusters very well.
> You use a cluster when your data and compute requirements are large and parallel enough that the tax paid on network latency trumps the 10-20X speedup you get on SSD and 1000X speedup you get from just keeping data in RAM.
Nowadays there's another reason to use clusters: to autoscale your expenditure wrt workload. A little inefficiency might be acceptable if you don't have to pay for a huge beefed up server idling at, say, 30%.
note: doing anything on a 250gb file in python will require a lot more ram than 250gb. generally my expectation is I will need 10x the ram as the size of the file when using pandas, for when I accidentally do something that triggers pathological behavior.
You have 250GB of "raw" data stored in CSV format. The parsed version of this data in memory is likely to be a fraction of the on-disk size. A `long` or `double` only take up eight bytes in memory but 10-20 bytes on disk stored as ASCII in a CSV file. Even if your raw data was 250GB you could store it in memory mapped files. A fast SSD can easily hit a gigabyte per second sequential read speed, far faster than your typical network.
Segmenting your raw data and using memory mapped files will let you work with large data sets without needing huge amounts of RAM. From there it's a question of your single system's processing speed and IO capacity. This is only necessary if your processing needs random access to the entire dataset.
If your CSV data is more like a streaming data source, you're processing each record as it's read in, you can just stream it in through `stdin`. At 1GB/s you're looking at five minutes or so to process your 250GB of raw data. A SATA SSD might take twenty minutes to stream that raw data.
> A fast SSD can easily hit a gigabyte per second sequential read speed, far faster than your typical network.
It's important to note that often your disks aren't directly attached to your compute. That's frequently the case in (particularly cheap) cloud instances.
That's the thing, you don't necessarily need a bunch of cloud instances to process data. If you must be doing everything in "the cloud" every service has dedicated instances with fast attached storage available. You can spin up one long enough to rip through your data instead of trying to distribute it over hundreds of workers.
It's also a domain where you can buy an off-the-shelf desktop for a few hundred dollars to do the work. That's the thrust of this whole thread, because scalable "cloud" systems exist and look cheap people obsess about throwing more instances at problems.
Modern commodity systems are ridiculously powerful and far more capable than people tend to assume. Even "the cloud" gets underestimated because people look at the low end cheap instances and assume they need to spin up hundreds of those when one beefy image for a short duration could do the same work.
so, a hundred of the cheapest cloud instances cost you what? 1000 USD per month? You can easily get a deskside workstation for 10000 USD with a nice 2TB NVME-SSD, a 32-Core Threadripper and 128GB of memory... Utilization might be lower of course, but even if it's only utilized 8h/day this seems like a bargain for any solid business. For any startup aiming for exponential growth by burning through cash and having a 15month "half-life" this is not gonna work out though
> A fast SSD can easily hit a gigabyte per second sequential read speed, far faster than your typical network.
Nit: 1GB/s is ok, not even solid let alone fast.
A fast SSD pretty much saturates 4x3.0 links (which explains why they universally tend to cap out at 3.5GB/s). In fact there are now a few PCIe4 SSDs (e.g. Corsair's MP600) which close in on 5GB/s.
I got the 1GB/s from my laptop I was writing the comment on. The internal NVME drive tops out at 1.5GB/s. I consider that fast but as you point out there's drives that make mine look slow.
Yes, but I'm heavily constrained because my access to the cluster I'm using is very low level and I get pre-empted quite a bit in my tasks. I'm probably over stretching between the amount of data I need to handle and my severe lack of skills (I'm a quant in training).
> Are you saying this can be optimised to fit inside a single 10 core server in terms of compute loads?
I'm currently employed to write software which does DNA analysis. DNA is known for being Big Data.
Some applications are very compute intensive and others are very data-relation intensive. The very compute intensive applications process about 1GB of data in about 30 minutes on a 32 core Xeon 6xxx with 32GB of RAM assigned to it. The data-relation intensive application processes about 350GB of data in 15 minutes on a single core of the same CPU but with 700GB of RAM assigned to it.
Both are heavily optimized but in different ways. So without knowing more about what you're doing with that data, it's hard to say.
So my entire dataset is ~24 x 250GB files. That 24 number can be larger if I can find an efficient way of processing each 250GB chunk. Each 250GB chunk is actually stock tick data so it has 500 stocks inside it. A heavily traded stock takes up ~10 GB of memory while a very thinly traded stock can top out at just 700 MB.
While I hope that each 250GB chunk has everything in order and I can separate it cleanly, I don't trust it. So I broke up all the files by stock in an embarrassingly parallelised code I wrote and got 500 separate files. However, the problem is, I'd want to process these 500 files parallely so each stock gets only so much memory (and hence, my original constraint remains). And for each stock, I want the data to be queried on timestamps, so I need a way to quickly say "I have data at 9am on Thursday, I want data from 4pm Wednesday to 8am Thursday to create a model".
I figured the best way to do this efficiently was to have the code create a massive database for all the stocks and query it efficiently in SQL. But I'm stuck there due to a lack of tools.
> I have no doubt I could solve your problems. I honestly don't care to do so here though.
That's fair. I did not think it'd be such a difficult problem when I first set out to do it myself. But every single turn leading to a dead end kinda bummed me out. I'm going to finally resort to the database method of storing all the data in one query-able file and work off of that.
That's very interesting. Do you mind sharing some more details on what those computations are? Where can I learn more about these DNA analysis use cases?
I've found that Wikipedia has really good high-level information for a lot of the subject matter. For use cases, the business sells direct-to-consumer DNA tests. I've worked on several of the analysis products' softwares.
The compute-heavy workload calculates edit distance [0] of short paired-end sequenced DNA [1] vs the human genome [2]. There is open source software to manipulate the FASTA/FASTQ [3] and SAM files [4] and run the calculations [5]. The aligned file is processed in a couple of minutes to genetic variation report [6] which is used for some of the analysis products that were purchased. One popular product will give you a haplogroup [7] which basically tells you where you are in a genetic tree.
The relationship estimator uses a different sequencing technology and basically consumes a CSV file from the sequencer's manufacturer. It uses a proprietary algorithm to calculate centimorgans [8]. That then gives relationship estimates between you and other people who've purchased the product.
Not who you asked, but, it's hard to say without knowing exactly what computation is being done, or how much of the time is spent on IO. If you organize that 250GB in ram the right way (cache coherency, right container types), and spend a lot of effort doing analysis of algorithm selection, you might be surprised how much you can get done on a single (large) machine with enough cores.
I'm doing my PhD in HPC (part of my work is in situ stuff). One of the biggest problems is actually IO. But honestly, 250GB isn't that big. I haven't used dask, but I know people that do. I just avoid python for anything I need performance for, C++ is just always going to be faster.
Biased: you might want to look into DOE libraries. For IO I suggest ADIOS2 [0]. There's python bindings too.
One of the biggest things you can do is use a different storage type like BP (ADIOS) or hdf5. These are readable but binary. But to really determine how to speed up your problem you have to know where the bottleneck is. Is it IO or compute? With 100 workers (threads or nodes?) you aren't highly parallelized. I mean that could be a single node if it's threads.
If there are floating point numbers in those csvs, I sped up a system like that 10x just by writing a custom (Java equivalent of) atof() that didn't do variable decimal separators and scientific notation. That's not even counting the improvements in I/O speed from the size reduction. Any system that works from CSV's is going to be slow. I don't know what sort of computations you're doing of course, but I did all the work on a laptop, couldn't be bothered to scale it out after the improvements I made in the first pass. How much of your code spends its time in I/O (including conversions) vs actually calculating?
Not sure which DB you are using, but you can load the csv file into the DB directly on a single thread using something like LOAD DATA INFILE.
If you have some good indexes and do some push-down work (give the database aggregation tasks to do instead of your python code), you should probably be more than fine.
For a 250Gb file.. should be ok.. maybe add some partitioning too.
I'm open to using any db that I can query over some engine with a python implementation. So any SQL db should be fine. However, I don't know how to convert a csv to an SQL directly. Is the command you mentioned part of some SQL server package? Sounds like it's exactly what I need.
pandas can read from a CSV file and then write to SQL. Even if you don't go the SQL route, you'd probably gain significant benefits by working with HDF instead of CSV.
Need to understand your domain better, but in many cases, the 250GB csv can be compressed down quite effectively using a columnar representation. And the columns can (potentially) be processed using simd/gpu based approaches to where a single server would outrun a cluster. Food for thought..
I used to think that. But there are a couple of notable exceptions at either end of the latency spectrum.
If your latency requirements are slack, then you can get away with one machine, because you can reboot or reprovision it and carry on processing without meet your requirements.
If your latency requirements are tight, you don't have time to fail over anyway, so you might as well run one machine and make sure you can deal with failing to meet your requirements.
This is why I ran Redis on a single server with no failover strategy. Our team spent maybe 15 hours workshopping a handful of other things (redis cluster, sentinels + replicas) before realizing we could spend an hour to have everything sit in a degraded state working around it while Redis itself was fixed. Redis only failed once ever and it took all of an hour to fix it, all during non-peak hours.
But isn't the containerization trend leading us to a completely opposite direction ie. scale-out by default and do that not only because of performance but mainly because of how you want to manage your production environment?
This is precisely the point made by McSherry, Isard and Murray in their lovely paper, "Scalability! But at what COST?" (Usenix HotOS '15). They demonstrate how much performance headroom there is in modern CPU and memory, and show how simple cache-sensitive batch algorithms running on a single core can outperform hundreds of cores running distributed map-reduce style jobs.
This is true in a "water is wet" kind of way. The point is that a great many problems that can fit neatly into a single machine are being turned into I/O problems by being distributed onto clusters.
There's an incredible number of gigabyte and even terabyte scale problems that are consuming racks of blades when a little thinking and understanding of the problem being solved can be done pretty nicely on far fewer resources.
What's really happening is that to many people they think it's "easier" to simply rack more equipment into the cluster and end up shifting the complexity into cluster administration rather than programmer time.
> I’m a C++ veteran btw and understand the point but big data is about how to process petabytes of I/O not how to consume CPU.
I'm not sure this is cut-and-dry. Back when I was working on Spark workloads, there was some interesting research being done on where the bottlenecks were for jobs. I think it turned out for a lot of jobs, infinite disk / network io didn't give as much of an improvement as you'd expect.
Yes, you are right. That said, the COST paper's presumption is that not every "big data" problem is big (petabyte size), that most probably fit in a single system's disk & memory.
Not for everyone. In finance, a query might only use gigabytes or terabytes of data, but need to do a ton of simulation and calculation on top of that. Optimization of e.g. trading algorithms is entirely CPU-bound.
Even CPU is about "I/O" these days. Memory (RAM) is the new disk - and memory bandwidth is generally the performance bottleneck in heavy workloads, especially in multicore. This might be one reason why loosely 'C-like' languages like Rust are going back in style. High-level languages are terrible for memory bandwidth.
No doubt there are people who do it for cynical reasons. But at least some people do it sincerely thinking it’s the right choice. It’d be more interesting to talk about them, and how they came to make the wrong decision for what they thought were the right reasona.
I think it's a bit presumptuous to say that the decision is "wrong". That paper demonstrated that a single server can outperform a small server farm on a toy problem. Nobody, not even google, solves pagerank in production as a batch job. Real problems are often more complex.
For many workloads, companies have tech leads and CTOs absolutely overarchitecting their stack. If you can run your entire system off of 4 loadbalanced $200/mo servers, why am I seeing hadoop/kafka/kubernetes/etc running on $10k+/mo premoney, so paid with investor money? Sure there are a lot of cases where this is fine, but I would say (from real life) that there are far more cases where this was a pretty poor choice. Usually proven to be even poorer when the tech lead leaves and no-one seemingly having a clue how it all works together, despite the whole docker/kubernetes/ci stories peddled to management. That is usually when I get asked to take a look.
In my experience these problems are definitely not always toy problems while some of them can, easily, be ran on 1 server for the expected lifespan of the company because the company will never get that much users/clients/data even though immensely profitable.
Not everyone (almost no one) is a FAANG and I find it offensive to use the company money (which can be investor money) to realise the wet dreams of the cto while the case for that architecture and the cost of it, makes 0 sense to the business and bottomline.
Obviously in life, things are nuanced and differ case by case but nah, real world problems are, generally, not more complex. They are more complex in some (rare) cases, like the one you named. But most companies are not doing anything like that, but they do have their cto’s gearing up for that incredibly unlikely future.
Talked to many people lit you mentioned. Large percentage repeats these points:
1) We always wait for database, so performance of our code does not matter. This comes from places where ironically the whole database can fit into RAM. 10+ gbps connectivity - well it is almost commodity for business so the latency is not much of a bottleneck. Fast IO to store data - well imagine array of Optane drives. Not very cheap but really peanuts for normal business. All this means that purposely built data server which is a heart of many type of business residing on decent computer, can be blindingly fast and most of the businesses will never ever outgrow it. I've written quite a few of such servers serving loads of businesses in NA so hopefully my experience is not irrelevant.
2) Scripting languages are so convenient to use and save so much time and this is what matters as the salary is a main expense. - Personally I use those mostly for management/deployment/etc.etc type of scripts and maybe to quickly test some small ideas. I clearly see their benefits there. Anything that resembles a product that would actually run business. Sorry but experienced developer can implement those just as fast as in any scripting language and it will save a ton on maintenance.
3) And finally do not do premature optimization mantra . - In my opinion switching from scripting language to something like C++ has nothing to do with premature optimization. Developer can be very productive with compiled languages as they also have megatons of libraries for any imaginable task.
> Sorry but experienced developer can implement those just as fast as in any scripting language and it will save a ton on maintenance.
This is just a no true Scotsman argument. For 10 years I’ve watched python/ruby shops drastically outpace projects in C++/Java shops. What you’re failing to realize is how trivial most apps are and how fast fully functional back ends can be created with frameworks in those languages (django/rails).
Your bit about maintenance is also bullshit. I’ve spent entire days peeling apart complex C++ code bases to make a small change to some core abstraction. Ability to maintain code is entirely up to how well organized and documented it is. It has nothing to do with “scripting”.
"What you’re failing to realize is how trivial most apps are and how fast fully functional back ends can be created with frameworks in those languages (django/rails)."
Sorry but I have the exact opposite experience. First of all I would not call line of business applications incredibly simple. They're rather quite complex business rule wise. I saw floors filled with web developers constantly writing /rewriting endless stream of scripts often without any meaningful attempts to organize the code. In one example I was writing Python (yes guilty it is good for this type of tasks) scripts to process literally thousands of files of their source code to find and properly replace database access methods of which single app had not less then 5.
Ability to maintain code is entirely up to how well organized and documented it is. It has nothing to do with “scripting”
> What you’re failing to realize is how trivial most apps are and how fast fully functional back ends can be created with frameworks in those languages
A vast majority of web dev is incredibly simple. People tend to easily convince themselves that their app is very special and requires a lot of complex solutions to make it work. There’s a lot of incentives that lead people to that conclusion, but most of the time following the golden path laid out for you by an existing framework is going to produce a better product with much less effort.
I run a saas platform and from the outset, I knew that we wouldn't need to do anything special or groundbreaking. We are just sort of line of business. But what I found is that the majority of the complexity started to show up when we are scaling and needed to be able to get results to the user very quickly. That's when you start to deal with queues and pubsub and architecting things to run in parallel. We have a process that only takes about 10 seconds to run completely. Which is acceptable lag time for us. But now, 20 people are simultaneously making that request and the person at the end of the line has to wait 200 seconds, which is not acceptable. This only happens occasionally, so adding servers would be a big waste. That's where the more complicated cloud setups start to help.
Its more like 'Here at BigCo, we do x because it scales. We have learned it from various incidents that we won't tell you about, but trust us we have to do it this way."
I've worked at BigCo. It's resume padding with the fear of looking like an idiot for not knowing about the new tech.
We were going to go all in on Snowflake with everyone on the team being for it. I sat down, read the original whitepaper, wrote a simulation of what the costs would look like with the current read/write statistics and tested a small batch of data on it to double check.
Turns out we would have paid between x500 to x10,000 what we were currently paying for using Postgres. I moved on three months ago, and last I heard they were trying to use Snowflake again.
I'm starting to wonder if there is market to do "technology laundering", use things like PostgreSQL, SQLite, standard Unix tools, put it under some cloud marketing and charge a x10 premium.
Or perhaps not only there is market, but that's more or less what everyone is already doing.
If you can set it up so it solves my problems, yes.
I could manage my own server and fine tune everything, or I can throw it on snowflake. Snowflake means I've spent almost no time managing anything and it was costing less than an aws box running postgres but absolutely blew it out of the water performance wise. Depends on your workload but it's been perfect for one of my use cases. If they were just using postgres under the hood and I got the same en experience - fine.
Big things are time sharing resources and management/updates/etc. Can you charge me 10x the underlying cost but let me pay for two minutes on a wildly powerful machine? Great, that's a net win for me.
I know nothing about Snowflake so can not really comment on particular case. However the generic statement that I hear often goes something like this: it is so much trouble to manage your own infrastructure and on cloud everything is done for you. Well I saw with my own eyes that having ones's infrastructure deployed on Azure keeps them quite busy anyways.
After a year of AWS I can tell you it's as much a PITA or more to manage all the components as it would have been to manage a few Spring Boot and MySQL Droplets.
Not really OLTP, analytical workloads but can't go into much detail I'm afraid. Infrequent, unpredictable and benefitting from rapid scaling (0 to lots of power & ram for short periods) is where cloud type things (can) really shine.
This was part of the excitement/promise of Go lang, to eliminate the need for small-scale mapreduce jobs, with library code that launched a year or two before the cost paper was published.
As someone who has implemented a complex system in C++ in this decade, I’d say he’s not wrong, but you need to carefully weight the pros and cons.
In our case latency and real time demands mattered a lot (NASDAQ feed parser), to the point of the (potential) slowdown of a garbage collector kicking in was enough to rule out Java and .NET. It runs entirely in memory and on 64+ cores.
We implemented our own reference counting system to keep some of our sanity, and to at least try to avoid the worst memory leaks.
This was an edge case, and for almost everything else you’re probably better off implementing it in something that handles memory for you. If performance is an issue, least try it in Go or Rust with a quick PoC before jumping the C++ wagon.
Memory management in modern C++ is considerably easier than it used to be. Bespoke memory managers aren't really needed, you can do almost anything you need to without ever using new or delete.
If you are writing C++, any field! Just use std::shared_ptr and std::unique_ptr from the standard library, along with std::make_shared and std::make_unique.
C++ shared_ptr unfortunately is artificially slowed down by multithreaded synchronizations. The count in the control block is always updated atomically. But I frequently need to use them not because my data is shared by multiple threads, but because the ownership situation isn't static. Consider for example implementing a single-threaded persistent tree.
This means people frequently need to reinvent their own reference counting mechanisms.
To be fair, std::shared_ptr is just a high-level component that's a part of the STL. Although it's defined in the C++ standard, it's just a generic shared pointer implementation designed to be as robust and bullet-proof as possible.
As with everything in the C++ STL, if you care about bleeding edge performance then you should be prepared to use more perform any (and less generic) components, whether it's data structure implementations or shared pointers.
This is the quickest way to kill your performance in C++. I guarantee it wasn’t what Carmack was talking about.
I used sharedptr extensively in a game engine. Whoops: suddenly 20% of the frame time was gone, never to be recovered. Once that performance is gone it’s almost impossible to get it back, short of rewriting every system.
Only use shared_ptr when you actually want shared ownership. If you stick to unique_ptr for owning references and then "borrow" that raw pointer in functions, then you get speed without sacrificing too much safety and still no new/delete.
Additionally, shared ownership should be rare. Most objects should have a single owner that is responsible for them. This is not only better for performance, but also makes the system easier to understand.
std::shared_ptr uses CAS atomics, which in heavy multithreaded code (multiple threads operating on the same pointers), can have surprising overhead in some situations.
On the other hand, not using atomics in heavy multithreaded code is just a recipe for disaster. The problem with shared_ptr is the fact that it uses atomics even in single-threaded code, which is obviously an overkill.
> The problem with shared_ptr is the fact that it uses atomics even in single-threaded code, which is obviously an overkill.
On the other hand imagine the security issues if shared_ptr was not thread-safe - you could just not reliably destroy a shared_ptr in any thread.
If you really know that you're going to get a graph of shared_ptr that do not move of a single thread, you can use boost::local_shared_ptr explicitely.
One of the advantages of rust is its ability to segregate safe-to-share (between concurrent contexts) and not so, at compile time.
While it can’t (yet?) be generic over them, it lets you use the non-atomic Rc in thread-bound structures and know this is never going to be shared between threads, whereas the more expensive Arc can have handles be moved from one thread to an other.
If you're parsing market data then you shouldn't really be allocating at all once the initial setup is complete. So there shouldn't be a need for ref counts.
This is because the allocator itself can have an unbounded runtime that takes milliseconds, causing you to drop.
In the past I've replaced malloc with an implementation that asserts if called on certain threads after init time.
Most likely it is simply invoking a syscall (sbrk or mmap), which invokes the scheduler and may yield the time slice. Feed parsing code may not issue any other syscalls (entirely userspace), turning the allocation into a bit of russian roulette.
Except compaction can also take many milliseconds and come from different threads.
Writing a trading system in Java is harder than C++ imo because where before you had an allocation problem, now you have a multithreaded randomly stalling allocation problem.
Virtu did it but everything I’ve heard about it nullifies the benefits of using java in the first place.
Newer Java GCs are very low latency (microsecond). You trade performance and memory for that low latency though. AFAIK, they are still compacting.
Still though, probably makes sense to do it in a lower level language. It's just far easier in C++ to decide that "Hey, you know what, I just want a big memory block that I control".
I've even heard of game devs doing things like having per frame allocators. They get super fast allocation because they just pointer bump and at the end they simply reset the pointer back to location 0. I'm sure trading systems could do something similar.
The point is that a one millisecond pause is unacceptable. Low latency Java GCs have average latencies of one millisecond, 99th percentile latencies of 10 milliseconds, and 99.9th percentile latencies are neither measured nor optimized for.
I don't consider it realistic to think that garbage collected languages might ever be usable in the context of game engines or HFT.
Game engines are already written in GCed languages. Java in particular.
You may be right that a 10ms pause is unacceptable for HFT. However, for a FPS, 10ms is more than acceptable. It translates to 1 or 2 missed frames in the worst case.
A bigger issue with using Java in particular for games is it's lack of value types. Writing high performance code for java is just that much harder because the language gets in the way.
The only serious engine I know of that's written in Java is Minecraft and its performance problems are notorious.
Microsoft XNA and Unity's .net support were also pretty popular but the popular games written in those languages (like Bastion) didn't have many heavyweight assets or allocator pressure.
Having written lot of Java, Scala and C++ (and recently some Rust), I must say it is much easier to avoid heap allocations in C++ and Rust than in GCed languages, thanks to explicit allocation on the stack and pass by value + move semantics.
A big push in .net core 2.x and 3.x was what we call the Span-ification of the base class library and the runtime. This means there are many new APIs for dealing with slices of memory in a non-allocating manner, and this combined with memory pooling has contributed greatly to an overall performance boost to the runtime by reducing copying and GC time. These same APIs are available to the developer so I'd imagine that it would be simple to build a non allocating network buffer reader. I have built a non-allocating video renderer before using the new APIs, for example.
There exist libraries for native memory pooling in Java as well, and we're using them. I'm not saying low allocation code can't be done in C# or Java. But these languages don't give some nice tools that are present in C++ and Rust - in particular RAII and automatic reference counting.
> I'm not saying low allocation code can't be done in C# or Java.
With modern tools and APIs it should even be possible to write alloc-free work loops in C# (though you're probably writing pretty alien C# at this point). AFAIK it'll remain impossible in Java until the "value types" effort bears fruits.
A few months back a study of sort made the rounds implementing a network driver in multiple languages (Ixy), C# did extremely well in it (better throughput than Go and near competitive with C and Rust at higher batch size though it was way behind on latency) while Java was pretty much in the dumps.
In one of the reaction threads (don't remember if it was on HN or Reddit) one of the people involved explained the discrepancy between Java and C# by not being able to go under ~20 bytes of allocation per packet forwarded in Java.
There were other odd / interesting results from the effort e.g. Rust was slightly slower than C, in investigating that they found out Rust executed way more instructions (especially significantly more stores) but had significantly higher IPC and much higher cache hit rates.
A trading system in C# or Java is not, practically by definition, state-of-the-art. To suggest otherwise just demonstrates not knowing the State of the Art.
But aren't the edge cases the things that satisfy your pay? The standard cases of today are the introductory examples of tomorrow and will be automated or abstracted away by the end of next week. In JavaScript.
I'm 15 years in writing high performance Internet servers in C++, and I can confirm higher level languages provide an illusion of capability, but once you're talking high performance with high compute requirements and scaling your service, the cost efficiency of C++ is exponential better than any other language. The higher level language ecosystems are bloated beyond repair. I was able to use one 32-core physical server running a C++ http server I wrote providing a rich media web service and replace an AWS server stack that cost my client 120K per month. The client purchased one $8K 32-core server and co-located it behind a firewall at a cost of $125 per month. And the C++ server ran at 30% utilization, plenty of room for user growth. Their AWS stack of a dozen C#, Python, PHP and Node apps were peaking their capacity too. If course, my solution caused existential questioning by the non-geek CEO and the CTO, but they were in crisis and needed to radically revise how they provided their service or close.
The thing that worries me about stories like this is that there is frequently (as is the case here) no mention any sort of HA or backups. No details on what disaster recovery looks like. Those are business critical considerations that cost money, and just disappear from the discussion when people say "hey i saved all this money dropping everything down to a single server!"
Well, in my case described above, single server solutions include an automated backup sub-system, and my servers expect multiple instances of itself to running on the client network, and these multiple instances synchronize with one another via additional endpoints specific for the purpose. The whole issue of HA and backups is critical and one of the areas my approach shines.
You're still not actually answering the question of how you are HA and backing things up with only a single physical server and nothing else.
If you're backing things up to the same server, that's not enough.
If the HA instances are running in the same server, that's not enough.
If there are other things besides that one physical server and it's power/network, you didn't include them in the cost, so the comparison is disingenuous there.
I do say the deployed system ends up being multiple instances of my single server, which synchronize with one another. Those are separate physical devices each running my one server. Additionally, when a backup runs the data is stored locally as well as on a physical storage device separate from the hardware it is running. Typically clients already have a firewall/router which is used to distribute requests to the various instances. My deployed systems are not one server, they become a server mesh.
>. Those are separate physical devices each running my one server. Additionally, when a backup runs the data is stored locally as well as on a physical storage device separate from the hardware it is running. Typically clients already have a firewall/router which is used to distribute requests to the various instances.
Awesome! Really glad to hear this is the case. But those are all added costs beyond the single physical server you gave the price of.
Server cost * number of physical servers you have deployed
Cost of your off-server storage
Cost of your network appliance doing load balancing
It's still probably way less than the AWS bill, but it's not really fair to compare the total price of infrastructure in one environment vs. just a portion of the other.
I am in the same boat (writing native servers). I will also "disappear" if you start asking me about HA/backups/etc. A particular solutions are very much case specific and can depend on business conducting rules just as much as on pure tech factors. Properly answering your question requires way too much writing and hardly a subject of a single post. I have HA solutions for the products I built but this post is the extent I am willing to talk about the subject ;)
I'm not asking about the specifics, and I don't really care about them.
But the fact of the matter is quite simply is that any single physical server solution will never be satisfactory for backups or HA purposes.
You can't store your backups in the same place as your data and call it good - what do you do when you have multiple disks fail and your RAID can't be rebuilt? This happens. What do you do when operator error accidentally destroys the array? This happens. What do you do when there's a datacenter fire and the server burns up? This happens. None of these things should be a business ending event, but if you only have a single server handling quintuple duty, that's what it has a real chance of being.
If you need HA, a single server isn't good enough even if you've got multiple VMs running the service. What do you do when the utility power is out and the generator fails to kick on properly or they run out of fuel? Both of those happen pretty frequently. What do you do when there's a network outage at the DC? This happens. When someone fucks up BGP somewhere and now the prefix you used is being routed to god knows where? This happens. When you have any sort of physical server failure that brings your single box down? This happens. Any of those situations will take you offline and render your HA meaningless.
I'm not saying that any of these things are impossible to do when colocating your hardware - but they're not free, and they're not mentioned even at a high level in this story. And since we're not talking about running benchmarks on price/performance and instead talking about a service that a business needs to keep available to their customers to make money, these are important aspects to talk about. Everywhere I've worked, keeping our services available and being able to recover from hardware failure are far more important priorities than being able to optimize performance. And doing those things properly takes more than one physical server.
I can't speak for FpUser, but you misunderstand the idea of creating a single server: one does not just run one of them, they know about and expect multiple copies of themselves to be running at different IP addresses, and they synchronize with one another, as well as maintain individual backups that additional background processes validate between different instances.
Each and every one of your disaster scenarios is handled by the architecture. Each an every one of your disaster scenarios has happened and we've lived through them, as well as after the fact reviewed and optimized how we handled the events. As you are, we're professionals.
Currently using Restbed (https://github.com/Corvusoft/restbed) as the server core, wxWidgets as a server side gui, with Boost, Curl, SQLite and Standard Lib. It's not that complex, beyond using lambdas in a few places. It has extremely high performance, and can run on an Intel Compute Stick, but I tend to use an Intel Nuc at minimum, with clients typically using whatever they have, gaining over redundancy and an ability to pair down. The memory management in a C++ application is just another resource one manages with whatever level of algorithm support you feel comfortable. There are ref-counting systems and complete garbage collectors available one can integrate into their business logic, unlike in a high level language that "transparently" manages memory outside application control.
Have you ever thought about how much processing a typical 3D video game performs every frame? What if that caliber of optimized algorithm logic were handling a rich media, non-3D game server hosted business application? It would have pretty amazing performance and scale very economically. That's what I do. Before doing this, I wrote 3D video games and their production environments.
Horizontal scalability carries a lot of overhead. Probably a factor of 10, easily. But the clue is in the name: eventually you'll get to a point where you have to scale.
Back in 2010 I worked for a company whose system, in Java, ran on a single web server (with one identical machine for failover). We laughed at our nearest rivals, who were using Ruby, and apparently needed 60(!) machines to run their system, which had about 5x the average request latency of ours.
Then traffic doubled, and suddenly we were having to buy six-figure servers and carefully performance-optimize our code, and our rivals with the 120 Ruby servers didn't look so funny any more. And then traffic doubled again.
But the clue is in the name: eventually you'll get to a point where you have to scale.
Isn't that the conceit underlying this whole argument, though? Many systems won't ever get to the point where you have to scale in that way, if you build them efficiently in the first place.
Perhaps more importantly, for many applications, you'll be able to see the limit coming some way ahead, and if you've reached a size where you do need a fundamental restructuring in order to start scaling horizontally, that's going to be a nice problem to have and you'll also have the resources to do it.
I know several online systems that are handling significant traffic volumes perfectly well on a simple, single-server basis. They don't get bogged down in infrastructure and tooling issues, ever. They don't get confused by complicated cloud hosting issues, ever. They are free to spend almost their entire development budget on actually developing useful functionality, which is like a breath of fresh air in today's dev culture.
Obviously in the more serious cases they probably also have some redundancy for backup/failover purposes, but even that is simple and if necessary can probably be handled manually when you only have a handful of servers to manage. Here I do slightly disagree with one of Carmack's later tweets, in that I would argue 100 servers instead of 1000 is just accounting, but 100 down to 10 or fewer is also more of a qualitative change (albeit not exactly the same qualitative change as 10 down to 1).
> You must have had significant in-memory shared state to encounter that problem. Right?
"Significant" is in the eye of the beholder. The core of the system was easy to make shardable. But you'd be surprised how many implicit assumptions creep in, how easy it is for ancillary parts to end up sharing state when it's easy. Also note that just because your state's in a database doesn't mean having two instances of the thing that accesses it will work, you can easily end up with an access pattern that assumes only one reader (for example) even though on paper there's no in memory state.
> Had you adopted a less stateful model, you'd have looked rather pretty with two Java servers.
Maybe. At the point where we're running 4 or 8 servers we'd have been facing much the same ops problems that they were. Java bought us an extra year or two of not having to deal with that, but also a significant amount of migration work when it just became impossible to stick to the single process model. At the end of the day we still kept the 5x latency advantage, which is definitely not nothing. But there were also definitely features that they brought to market quicker, and I'm pretty sure Ruby played a part in that.
Tradeoffs, tradeoffs everywhere. I left before the final outcome of that fight (for all I know it's still ongoing), but I don't think either company was being dumb.
> At the point where we're running 4 or 8 servers we'd have been facing much the same ops problems that they were.
Yes and no.
You'd need some ops work, but you'd need to worry a lot less about managing your infrastructure provisioning to keep costs low, e.g. reserved instances, dynamic scaling, etc. and putting out fires when you inevitable exceed your tight perf margins.
You could overprovision 24/7 by 50% and write it off. Your competitors couldn't.
> Web servers are usually trivially horizontally scalable.
Maybe he really meant to say 'application server'. I.e that there was server side code with non trivial compute/memory requirements for running business logic.
Of course it's possible to write a horizontally scaled application in Java or C++. But once you have to deal with horizontal scaling anyway, language performance is much less of an advantage: as Carmack says, the difference between 100 servers and 10 is just accounting.
And the difference between 1000 servers and 100 is not just accounting. In essence, 1 full rack is easy to reason about while N>1 is not, for myriads of non accounting reasons.
It should be noted that report is talking about the client devices too - desktop, laptops, and even TVs. The datacenters part only accounted for 15% of that.
Sometimes in a specialized team, the difference between 100 servers and 10 is profitability. Don't want to get laid off because that cloud bill is $100k a month.
In my experience it's the individual contributors who are overly obsessed with being elegant and efficient in their use of machine time. Those who are conscious of the bigger picture tend to have a more accurate sense of the relative costs of machine time versus engineering effort.
Multiplying your datacenter bill by 10 because you couldn't be arsed to spend two days thinking about system architecture isn't an "accurate sense of the costs" in any universe, except the one where you're spending venture capital bucks and waiting to get quckly bought by a Google with more free money than they can count.
Sounds like they had their datastore as part of the application. So only way to scale it would be to write a distributed data store... Or rewrite the app :p
Yes, I’m always shocked by just how much performance overhead most languages have compared to C and similar lower level languages. It is a price worth paying for better language ergonomics, but I do wonder whether Rust might be able to give us the best of both worlds here.
I semi-seriously think the entire modern shape of the cloud is a result of Ruby being really slow.
Back when people were writing their backend business apps in C++, COBOL, Java, etc, if there was ever a performance problem, you could usually just get a slightly bigger machine and grow your thread pools a bit. But once the web took off and Ruby exploded onto it, you couldn't do that, because it's an order of magnitude slower, and doesn't really do multithreading. But, as long as you follow twelve-factor discipline, it scales horizontally like a champ. So, we took to horizontal scaling over multiple VMs (and caching things in Redis instead of local memory or Hazelcast or whatever), and that's been the unquestioned way to do scaling ever since.
The push for the need of scaling out started with Ruby and Python's lack of performance. The reason being pushed at the time was, "developer time was more expensive than hardware." Well, that didn't count the amortization of developer time over the lifetime of the product once the product was developed.
It's mostly a fallacy that a demanded product becomes "developed". Maybe a game that gains cult status and therefore a long tail end of life. But popular web services are in constant churn and in that space it's valid to trade hardware for programmer productivity.
In my experience this is only true iff the system never gets any more user facing features.
Every new non-insignificant feature requires a new REST-API route, or database table or modification of the GraphQL schema. And depending on how you designed the backend, even small redesigns of the frontend might require changes on the backend. Consider a simple app showing car rentals, where you initially have something like /car/[id], then the frontend guys realise that we need to show cars rented by each customer and it's necessary to have /customer/[id]/rentals.
(Of course it's possible to design a system without any schemas or normalisations, which would make your statement truer, but that's rarely attractive for other reasons.)
But all those are just extensions using the established frameworks and development practices for that backend application. Once those are developed, it’s pretty easy to add new features. I would argue using Ruby and Python like script languages brings no significant benefit in development time. In fact they make it worse when working on an existing app, with the extra unit tests, difficulty in refactoring, and extra performance work.
> Well, that didn't count the amortization of developer time over the lifetime of the product once the product was developed.
I'll bite. I think it actually counts not only that, but also the probability that the product will actually get to be developed, and not need to significantly pivot, therefore throwing all developer time spent on perfomance to garbage.
If hardware speeds and capacities had continued to increase as rapidly as they used to, there might have been some truth in that. Buying a computer with a faster CPU or a bigger disk or more RAM was a relatively cheap solution to a lot of performance problems for a long time, in no small part because it typically required zero changes to the software itself. However, once you're talking about qualitatively different hardware architecture (scaling out instead of up) and therefore also qualitatively different software architecture, it's far from obvious that the premise still holds or that we should even expect it to.
I don't necessarily think you're right here, but I do know the number of horror stories that have come out of Heroku over the years certainly validates the opinion.
I remember reading about their routing debacle and realizing just how much work went into trying to get ruby to scale.
Your points are all valid, but I don't think it's the "unquestioned" way. It simply happens to be great way to scale, and also to isolate complexity etc. You can scale to monster loads this way, they way that in-process caches don't.
The idea that GCed languages in general have Python-like performance is a dangerous myth. Languages that are managed but not interpreted (e.g. Java, OCaml, Haskell, C#, Swift) have performance characteristics that are much closer to C than to Python.
While I agree they can sometimes compete in numerically heavy tasks, the place where they fall behind is memory management. Systems like databases need very careful memory management and GC is not always your friend there. I'm still hoping some day we'll be allowed both GC and Rust-like manual memory management in a single language, although I'm not sure it is at all possible.
> the place where they fall behind is memory management
It's sometimes possible to design C# code in a way that doesn't stress GC too much, so the majority of bandwidth bypasses the GC.
For instance, here's my old .NET project which plays streaming media for hours without interrupts, with a hard limit of 15MB RAM for the whole process: https://github.com/Const-me/SkyFM
Doing so became easier in modern .NET, with these value tuples and spans.
> still hoping some day we'll be allowed both GC and Rust-like manual memory management in a single language
Microsoft tried, managed C++, then C++/CLI. The CLI is still supported if you want to try, but IMO they both were way too complex: 2 types of pointers, two runtimes with weird interaction between them, worst of both sides on safely and ease of use.
AFAIK most people only used these languages for a thin layer of glue to integrate native C++ with C#. And even for that limited use case, COM interop or C interop often worked better. Even MS switched to COM interop in the next iteration, C++/CX.
Actually, Python is compiled to its bytecode and then interpreted by the Python VM, which is also how Java works. Python is slow because the lack of funds and people's focus, just take look at how fast JavaScript(V8) is nowadays.
No, Python bytecode is still interpreted at runtime. Java, JS and .NET is first compiled into bytecode but then they are also JIT-compiled into machine code which Python is not.
You can JIT Python also using https://www.pypy.org/ but it's not the default and it's not 100% official and compatible.
It's also nowhere near as fast as other language JITs.
I'm talking orders of magnitude slower than JVM, .NET and LuaJIT. End result being a - relatively - dead project, since the incentives for people to use it over stock Python and pay extra compatibility costs are not there.
I think you entirely ignored the performance cost of dynamism (not having types known, not having value types, dynamic binding of methods etc..) that is handy 1% of the time but imposes type safety & performance cost other 99% of the time..
> It is a price worth paying for better language ergonomics
Back around 2001 or so, I was hired at an online travel agency to assist in porting their entire system from C++ to Java. Management at the time was frustrated at how long it took to add new features in the C++ codebase - it could take months to get something working in some cases, and they blamed the programming language. Once we got everything working in Java, we found that we could, in fact, turn around feature requests much faster than they could in C++. The trade-off was performance problems: they (we) found that in the old C++ codebase, if you fucked something up, the whole thing crashed, and you had to fix it before you could get it to run. In Java, programmers could paper over their fuck-ups pretty easily so that they wouldn’t be noticed until they had created a snowball effect that caused everything to slow down. Since management was pushing for more features faster, they were incentivizing developers to do as little testing as possible and kick the can down the road. For the most part, management was OK with this: they just bought more, and more, and more servers to make up for the performance problems we were having from the quick-turnaround features they wanted.
He did say that Java/C# are also up there, and Go is in that family, so it probably remains the best balance for lots of cases. I do think there's also territory to be explored writing hot paths in Rust and interoping from Python/JS.
not how i read his tweet. he was lamenting that python is too slow for some server side development use cases. and he gave cpp as an alternative that would be simpler and faster. he even followed up citing java and csharp.
totally agree. if all your backend server is mostly complex serialization & de-serialization, and pushing bytes to other sub-systems, i think many other languages have advantages over python.
IMO, the linked article is much more insightful than the flippant comment here might suggest. It's a solid argument, backed by real world data, about how easy it is to make bad assumptions equating better scalability with better performance.
Carmack is moving to AI and inevitably has to deal with a lot of Python, which still bottlenecks process here an there despite all the effort to move computation to C extensions. I have a really high hope that he detours a bit and creates very-very good non-python tooling for ML.
Python has some fundamental language semantics that make it really hard, if not impossible, to create an implementation that can match Java. Pypy is probably the best you can do to optimize python, and it shines in tight numerical loops, but it gets less effective as code gets more complex.
I've been programming C++ and assembly for 23 years. Few years ago I became a huge fan of Python. In my opinion Python is amazingly well suited for rapid first revision and can then be swapped out for C++ / asm.
This is fine as long as you can convince management to spend the money to rewrite your software. That's usually a hard sell though. In my experience this plan usually ends up with a python monstrosity that everyone hates but is forced to deal with forever.
You just have to write a tiny part that uses a lot of CPU in C++/asm or anything else.
Much of code's performance isn't really reflected on to the scalability since mostly a tiny part of code is really ran a lot of time, and the other parts are just glues or management stuff or rarely used(not used in scale) features.
It depends. You aren't going to make a very fast modern codec encoder or decoder using Python. The hotspot ends up being the vast majority of the process. That management/glue layer becomes very thin, amounting only to feeding in the bitstream and reading back the raw video frames.
My good friend built whole career doing exactly the same. "Replace a cluster of 10 Elasticsearch servers with 1 running a custom built C app and in-memory database".
Of course, it won't work out to replace 1000 Elasticsearch servers - that's where the advantage of a true "big data" tool will show - but none of the clients really have data "that big".
That's the reason why you need to go multi-process if you want to reach a similar level of concurrency in Python as multi-thread in C++. And that surely adds a lot of complexity.
As a very practical example of this, TensorFlow has a dedicated page with advice on how to make the Python part that reads the files from disk less slow. Think about that: The bottleneck for training a highly advanced AI with millions of parameters is in the 20 lines of Python code to read in a binary file...
Many other people "know" about the GIL, to the extent of believing there's no point using threads in python "because of the GIL".
I had a funny such experience lately in a job interview. I told the interviewer his misconception could be falsified with ~10 LOC summing a list with 2 threads.
How did you ensure those threads would actually execute concurrently?
If any operations were GIL-bound (which is extremely common, even if not intentional, by reliance on bytecode instructions dealing with CPython API under the hood, like attribute lookups or iteration special methods), then execution is probably interleaved serially and constrained by the interpreter’s GIL allocation, and slower than just summing serially.
I’ve seen a lot of people who are cocksure they know some contrarian “actually you can use threads” trivia about Python and just naively use the threading module or naively use the newish ThreadPoolExecutor stuff not realizing that no, in fact, it’s not somehow magically always GIL-avoiding to do so.
> How did you ensure those threads would actually execute concurrently?
By having multiple concurrent IO requests from multiple threads? Can probably log detailed timestamps to see if IO requests were happening in parallel.
You're right of course. I assume the OP used a trick by going via IO somehow. Though granted, it might require having a forking server to do the actual summing or using another language, which would be more than 10 lines of code
Ok, I see some comments (rightfully) asking for less talk and more code.
# main.py
import random
from concurrent.futures import ThreadPoolExecutor as Pool
items = [random.random() for _ in range(10 ** 7)]
def run(items, n):
step = len(items) // n
with Pool(max_workers=n) as ex:
res = [ex.submit(sum, items[i*step : (i+1)*step]) for i in range(n)]
return sum(r.result() for r in res)
if __name__ == '__main__':
import timeit
import sys
n = sys.argv[1] if len(sys.argv) > 1 else 1
time = timeit.timeit('run(items, %s)' % n, 'from __main__ import run, items', number=10)
print("%s\t%.3f" % (n, time / 10))
$ for x in `seq 1 16` ; do python3 -m main $x ; done
1 0.172
2 0.170
3 0.166
4 0.155
5 0.149
6 0.142
7 0.144
8 0.140
9 0.136
10 0.135
11 0.135
12 0.137
13 0.135
14 0.136
15 0.136
16 0.136
The point of the code is not to speedup the execution of summing a list of random number, but rather to speedup the acknowledgement of N random python developers that they have some misconceptions about the GIL.
I think it does that pretty well but, well, that's just like my opinion.
The major reason people use threads is to speed up compute intensive tasks.
Ideally, pure computation like the one in your example should get a speed up linear to number of threads. 2 threads -> 2x faster. 3 threads -> 3x faster, up to a number of cores.
If that's not happening then you have excessive locking.
In Python excessive locking is caused by GIL - Global Interpreter Lock. As the name implies, there's a single, global lock and execution in a single threads effectively blocks all other threads because it holds GIL captive.
What this particular benchmarks is showing is that GIL is as bad as everyone says: instead of getting 16x speedup, you get speedups that are almost within margin of error for such a coarse measure.
I feel like you're beating a strawman. Using threads to speed up IO in Python is common, and for computation heavy work the fear of the GIL is 100% justified, as your example shows.
I'm a casual Python programmer (just bit of scripting) and I had no particular misconception about the GIL; I don't care because I write single-threaded cookie cutter scripts.
Your example I think is demonstrating the opposite of what you want to show. Those figures are atrocious.
What were the misconceptions of your interviewer you were trying to prove wrong with your POC of multiple threads summing a list?
The way I read your post, it seemed like your interviewer told you that threads in python are not effective for parallel computation because of the GIL, and your example proves exactly that. The performance of your threads are absolutely horrible, if you were to do that in C++/Java/Go you would likely see a speedup in the order of min(16, cores). Your example proves that your threads are exactly serialized during the computation, which I assume was the point of your interviewer (but please clarify my assumption).
Perhaps your interviewer was instead proposing that threads in Python work like a charm?
1. There is a common misconception that any threaded python code is slower than sequential. This is trivially false for IO-bound, and possibly, potentially, in some cases false for CPU-bound. Said interviewer had that very misconception.
2. More importantly - you and another parent are right. The code does not demonstrate what I thought it does. This is proven by removing the ThreadPool and leaving the code intact:
import random
import math
from concurrent.futures import ThreadPoolExecutor as Pool
items = [random.random() for _ in range(math.factorial(11))]
def run(items, n):
step = len(items) // n
res = [sum(items[i*step : (i+1)*step]) for i in range(n)]
return sum(res)
if __name__ == '__main__':
import timeit
import sys
n = sys.argv[1] if len(sys.argv) > 1 else 1
time = timeit.timeit('run(items, %s)' % n, 'from __main__ import run, items', number=10)
print("%s\t%.3f" % (n, time / 10))
$ for x in `seq 1 11`; do python3 -m main $x ; done
1 0.799
2 0.749
3 0.671
4 0.715
5 0.730
6 0.704
7 0.649
8 0.631
9 0.689
10 0.613
11 0.616
This is a result I'd have to silently contemplate before making any further comments.
I’m pretty sure the time “savings” you are seeing here come from somewhere else. At a first glance, your copying the huge list while submitting it to the thread pool, this has overhead. Maybe lots of smaller copies are faster on your machine.
I don't know the specifics of Python's C interfaces, but I know Ruby by default will cover all 3rd party extensions in the GIL for safety and you can release it upon entry into the extension using the C api. But doing so can be very dangerous if you're calling back into the Ruby core as you could inadvertently cause problems that the GIL was meant to protect you from.
It still hits the problems, but it can still be better than single-threaded. That's the point of his comment. People say the GIL is bad so they throw the baby out with the bathwater and no longer use threads, which isn't very well-reasoned.
Been a while since I've used Python, but as far as I remember the GIL only affects Python objects. So if you use Numpy for operations, you can avoid the GIL.
And, more importantly for us, while numpy is doing an array operation, python also releases the GIL. Thus if you tell one thread to do:
print "%s %s %s %s and %s" %( ("spam",) *3 + ("eggs",) + ("spam",) )
A = B + C
print A
During the print operations and the % formatting operation, no other thread can execute. But during the A = B + C, another thread can run - and if you've written your code in a numpy style, much of the calculation will be done in a few array operations like A = B + C. Thus you can actually get a speedup from using multiple threads.
>Threads have been in the standard library for a very long time.
Yes and still run the risk of running into GIL problems. You claim to have shown 10 lines to an interviewer that proved he was wrong about the GIL still being an issue and I've yet to see an example of it not being one when using pure python. Yes there are certain cases where you don't hit the GIL, no that doesn't mean it isn't still an issue when dealing with threads.
I just meant that one way to make threads "work" is to use IO, so I assumed the OP did a trick along those lines. Basically two threads can receive data or read from disk a the same time. Same thing can be accomplished with a select/epoll setup but threads would just be less lines of code.
Otherwise another trick is to identify a call which uses extensions or releases the GIL once it goes into C.
> I had a funny such experience lately in a job interview. I told the interviewer his misconception could be falsified with ~10 LOC summing a list with 2 threads.
I had multiple of those experiences. Even then people wouldn’t believe me so we had to look at Python C code and read about threads and IO operations in C and such. The other alternative is to write a deliberately slow socket server that takes say 5 seconds to respond then then write a threaded client to issue requests and seeing how they complete in parallel.
Exactly, and I’m surprised none of the other comments mentioned this so far. Often web app endpoints are bottlenecked by IO because they’re spending most of their time talking to a database or cache server. Python is probably not the right tool for a CPU intensive endpoint that needs to serve up hundreds of thousands of requests per minute and can’t be cached.
There are a lot of ways to handle the IO intensive scenario in python:
* Threading - Works with python libraries written in C, but now you need to add locks to your code to prevent race conditions. Not good for CPU heavy work because it’s switching context every ~100 instructions.
* Gevent - Requires minimal code changes, but it usually blocks when running code from python libraries written in C. This uses event loop style concurrency and automatically patches internal python libraries to switch context when IO occurs. This means less time spent unnecessarily switching context, so it can scale better than python’s threading.
* multiprocessing - Better for CPU intensive work, but requires more memory than other solutions. You don’t need to worry about race conditions since the processes are separate.
* asyncio - Requires code changes and using compatible libraries. It does event loop style concurrency that allows you to specify specifically when to switch context. Like with gevent, this means less time spent unnecessarily switching context.
* ...And there are probably a bunch of other ways I’m missing.
I’ve heard that this sort of stuff is simpler in Go.
From what I’ve seen in production environments, Python has an uncanny ability to take tasks that should be IO- / server- bound and make them CPU-bound.
For sure, but often that’s a sign that you’re doing it wrong. Maybe those aggregations should be done in the database, maybe it should be cached, maybe you should do that in bulk with one request, maybe it should run in the background anyway, etc.
Sometimes the language is the reason it is IO bound. A good example of this and discussion is in the link below. The heap of abstractions can cause it to be IO bound, or the CPU usage can bound the IO.
Percentages are misleading, what matters is the actual time it takes. Even small gains will have an effect as you increase the traffic and add additional stages to the system. Small delays accumulate and can have a surprising effect on queue sizes and latency.
I think you misunderstand the training of many types of large statistical models. It’s almost always I/O bound (getting batches of data into memory or shipped to GPU) and the CPU bound part is intensely optimized SIMD matrix algebra operations.
The hard part is always that you have more data than can fit into memory, and need complex prefetch solutions to parallelize getting the next batch of data while one batch’s numerical linear algebra computation is in progress.
Many libraries handle this transparently for the user in an extension module, such as Keras & PyTorch.
I think it’s misleading to act incredulous that this is the bottleneck ... it doesn’t have anything to do with Python, GIL, etc. (and there are easy-to-maintain solutions for this in Python).
Years ago I had someone on reddit arguing with me about whether or not the GIL existed and affected Matz Ruby (this was 10+ years ago).
This person was actively arguing with me about what it was. It's one of many battle stories for why I pretty much just assume everyone on reddit is stupid unless shown otherwise.
You can choose a language that optimizes your hardware, or you can choose a language that optimizes your programmers. 99% of the time optimizing the programmers is the right call.
Yes, if by "optimizing programmers" you mean "optimizing the manager's corporate structure footprint and bonus incentives".
If the choice is between hiring one good C++ programmer or 15 really dumb Python "backend engineers" (and a team of QA's and sysadmins to support them), what do you think your pointy-hared corporate boss would chose?
This is a false dichotomy. You don't have to choose between 1 C++ developer and 15+ package of Python developers.
Personal productivity is generally going to be slower in C++ than it is in Python. In most situations you would gain more productivity out of a similarly skilled Python dev as you are a C++ dev, so you'd probably need to hire more C++ developers.
Unless your argument is "C++ devs are smart and Python devs are dumb", in which case, let's not start calling people names over the language they use.
> Personal productivity is generally going to be slower in C++ than it is in Python.
No, it entirely depends on the skill and experience level of the programmer.
> In most situations you would gain more productivity out of a similarly skilled Python dev as you are a C++ dev
No, a good programmer with equally good knowledge of both languages will code equally fast in both.
> This is a false dichotomy. You don't have to choose between 1 C++ developer and 15+ package of Python developers.
You failed to see the point. Large teams of clueless programmers doing things slowly and badly is a feature of the system, not a bug. KPI's for managers don't include lowering headcount and cost cutting as an incentive. (And trust me, you really wouldn't like it if they did.)
> Unless your argument is "C++ devs are smart and Python devs are dumb", in which case, let's not start calling people names over the language they use.
Yet that's effectively what you just did in your specious 'productivity' argument.
>No, it entirely depends on the skill and experience level of the programmer.
We're comparing programming languages. You control for the other variables - otherwise the comparison is meaningless. My argument is equally skilled programmers will generally be more productive in Python than C++.
>No, a good programmer with equally good knowledge of both languages will code equally fast in both.
Care to explain how? Huge amounts of the features in these higher level languages are explicitly to increase productivity, and it's generally pretty well accepted they're successful. In this very comment section there's multiple sets of people talking about having to implement their own reference counting systems, and all sorts of other things. Implementing those systems takes up productivity.
If any one language was the best at everything, we would only have one language. There's trade offs made, and that's why there frequently is a right language (or set of languages) for one job vs. another.
>You failed to see the point. Large teams of clueless programmers doing things slowly and badly is a feature of the system, not a bug. KPI's for managers don't include lowering headcount and cost cutting as an incentive. (And trust me, you really wouldn't like it if they did.)
This simply hasn't been the case anywhere I've worked at for a meaningful amount of time. Empire building was plenty discouraged at all of the places I have been employed long term and there were KPIs for reducing headcount, or maintaining it while taking on additional responsibility, and accounting was always happy to step in if costs were increasing without solid justification. Reducing headcount doesn't even have to mean firing people or managing them out - it can be not backfilling spots, helping people find teams with open headcount and transferring, etc. It's never bothered me any - if we have too many people for the amount of work, I'm more likely to get bored.
>Yet that's effectively what you just did in your specious 'productivity' argument.
You started this whole tangential discussion by taking shots at Python developers for no apparent reason. I'm not calling anyone names, or anything even remotely similar - different languages are frequently differently suited. There's obviously places where C++ makes a lot of sense.
I don't really have a horse in this race - right this moment, I'd do basically any serious project in rust, and cargo-script has me even doing small scripts in rust as well. Maybe Erlang or Elixir if OTP makes a lot of sense for the project.
You have not had to interview the python devs I've had to interview.
Last round, for the second best candidate who we hired at 130% the salary we initially offered, after the best candidate was snatched under our nose for what we were told was twice the salary we were offering:
>"Last question, I saw that you wrote a 100 line function here, is this because you ran out of time?"
>>"No. I don't like to break up my functions. Having many small functions is confusing and bad practice."
Opting for a performant but not scalable solution is basically an acknowledgement that:
- The project will only succeed up to a certain point.
- After the initial launch, no further changes will be made to the project since new additions are likely to cost performance and lower the upper bound on how many users the system can support.
Not many projects are willing to accept either of these premises. Nobody wants to set an upper bound on their success.
Also, it should be noted that no language is 'much faster' than any other language. Benchmarks which compare the basic operations across multiple languages rarely find more than 50% performance difference. The more significant performance differences are usually caused by implementation differences in more advanced operations and in any given language, different libraries can offer different performance characteristics so it's not fair to say that a language is slow because some of its default advanced operations are slow.
Usually, performance problems come down to people not choosing the best abstract data type for the problem. Some kinds of problems would perform better with linked list, others perform better with arrays or maps or binary trees.
Time complexity of any given algorithm is way more significant than the baseline performance of the underlying language.
the title is misleading. sure, it advocates for C++, or Java, or C#, or others above Python, but clarifying the context: "a lot (not all!) of complex “scalable” systems can be done with a simple, single C++ server.", which I take as: "sometimes it's better to write a simple server in a low level language, than a complex server in a higher level language".
people will bike to work to "save the environment" but won't use C direct on metal to reduce the carbon footprint of their code.
For a guy like Carmack, it may be quite frustrating working with the constraints of pytorch etc. He ll probably end up making his own pytorh frontend in C, which as a bonus people will use to deploy models.
I don’t think it’s that big of an language issue. Java isn’t that much slower (logarithmically speaking). But design choices have arbitrarily incorporated huge amount of bloat and inefficiency.
yep, logarithmic. Well, the question is what's the motive about those design choices? Today's e.g. web app choices seem to be basically aesthetic with an eye for extreme novelty, and total disregard for how many times a buffer is being copied back and forth before reaching the user.
That is compared to python/ruby which would be multiple logs slower or scaling choices that makes the service resemble the microkernels. Usually dedicated, simple engineering can be very fast using either C or java
Hypocrisy is not such a rare animal. There is however extra factor here: biking to work if possible is your personal choice. Using your favorite tools at work: often not as much
probably not so much hypocrisy as lack of consideration. Even though the energy benefit from making apps e.g. twice as fast isn't so big, it's worth giving a try for the time savings.
Keep in mind a modern commodity x86 server with 128 physical cores 4 TB RAM and decent amount of SSD storage and dual 100 gbe nics is about 70K.
Ability to use something like Rust also changes equation significantly.
I believe the key is able to use something _native_, not interpreted code or bytecode, running on a virtual machine, which runs in a userspace process inside a virtualized server inside a bare metal server.
I think it’s a valid point that the reasons why C++ is good for perf - compiled, statically typed, low-level, manual memory management, high degree of control over memory layout - would also apply to Rust. If they don’t then that would really call into question positive claims being made about Rust.
I was curious mainly because I remember some long comments he made about the relative value of linting and static analysis tools, specifically going into Coverity's analysis of the Doom 3 code (I think), fixing everything it flagged for some subset of the code, and then asking himself whether it was really better or more of a hindrance that obfuscated code.
IIRC, his conclusion was mixed: a lot of it was obviously beneficial and worth having turned on, but much of it wasn't, and going forward he intended to make it a limited but continual part of his toolchain.
So my interest in whether he'd tried Rust was whether he'd compared Rust's changes like the borrow checker against his earlier conclusions on writing good C/C++. Cute that he's tried and liked it, but I'd really like to see a more in-depth comparison from him.
Right. He’s got some kind of notorious programming style with who knows what kinds of object graphs. It would be quite a data point to know if he thinks that he can comfortably map it to Rust.
Morals of this story:
1. Always use the most performant language available to you. (What if your program gets a few million users?)
2. Horizontal scalability is too much complexity/work. Just apply the correct amount of optimization when you initially write the code.
If you, like me, read this on mobile and did not click [more answers] link under that, then do it. That may save you a minute or two of derealization time.
I really had a hard time to decide whether I agree with that statement or not. If you design a whole service it is obviously not true. But if you develop something like a specialized Backend, say a database, you might want to reconsider the complexity of horizontal scalability.
2 Is almost no longer true because there are simple templates you follow that will scale you till infinity with cloud (which also means your cost is going to scale, but hey we saved engineer time from scaling)
This has been a repeated point since "enterprise" Java (a.k.a. Jabba) became a thing in the late 90s and early 2000s. A ton of enterprise code is comically inefficient and held together with scotch tape and used chewing gum.
It ends up boiling down to the fact that compute power is a lot cheaper than developer time and really good developers are more expensive and harder to find than inexperienced ones.
From this thread, I feel a lot of people misunderstand what a scaleable system means for a scale-up startup or BigTechCo. It doesn’t mean cost-efficiency. It means the ability to solve scalability issues at 10x, 100x, 1000x load by throwing money at it (aka buying more instances/machines).
Yes, John Carmack is right that a bunch of scaleable systems that see reasonable load, with well-understood traffic requirements could be rewritten more efficient ways. But how long would it take? How much would the extra work cost? And what would happen if traffic goes up, to another 10x, 100x? Can you throw still throw money at that problem, with this new and efficient system already in-place?
One of my memorable stories comes from an engineer I worked with when we rewrote our payments system[1]. He told me how at a previous payments company, they had a system that was written in this kind of efficient way, and needed to run on one machine. As the scale went up, the company kept buying bigger hardware, at one point buying the largest commercially available mainframe (we’re talking the cost being in the multi-millions for the hardware). But the growth was faster and they couldn’t keep up. Downtime after downtime followed at peak times, making hundreds of thousands of losses per downtime.
They split into two teams. Team #1 kept making performance tweaks on the existing hardware to try to get performance wins, and increase reliability during peak load. Team #2 rewrote their system in a vertically scaleable way. Team #1 struggled and the outages kept getting worse. Team #2 delivered a new system quickly and the company transitioned their system over.
What they gained was not cost savings: it was the ability to (finally!) throw money at their growth problems. Now, when traffic went up, they could commission new machines and scale with traffic. And they could eventually throw away their mainframe.
Years later, they started to optimise the performance of the system, making millions of $ in savings. The new system - like the old one - cost millions to run. But that was besides the point. Finally, they stopped bleeding tens of millions in lost revenue per year due to their inability to handle sudden, high load.
It’s all about trade-offs. Is cost your #1 priority and are you okay spending a lot of development time on optimising your system? Go with C++, Erlang or some other, similarly efficient language. Is product-market fit more important, with the ability to have high uptime, that you can buy? Use the classic, horizontally scaleable distributed systems stack, and worry about optimising later, when you have stable traffic and optimisation is more profitable over product development.
While that’s true, this is the same issue I have with Hadoop clusters which can easily be replaced by a single server and some grepping (well almost;))
Most companies or startups or even bigtech are not big enough for these problems.
All these complexities are horrible, but good for the AMZN stock price, as devs keep spooning up a ton of VMs
What about in the context of trying to get to an MVP? Is the dev time speedup of using a dynamic programming language and stack significant over using a c++ backed? You wouldn't care much about performance when you're trying to figure out if you'll get traction.
No, it's not significant. Dev time depends on programmer skill, not the toolset. A good C++ programmer will develop your MVP many times faster than an average Python programmer.
Python programmers are much easier to hire, though - you already need a good C++ programmer on the team to hire another one, because HR and corporate management can't into proper hiring process.
This last factor is the overarching most important one for BigCorp Enterprise Inc., not development speed or cost.
> Dev time depends on programmer skill, not the toolset.
This is obviously not strictly true, always. A skilled programmer will use the proper tools for the job.
If you for example is tasked with writing a backend service exposing a GraphQL API, I think it would be foolish to do this in C++, and would bet that the average Python programmer would do it quicker than even a top-tier C++ programmer (if the latter would be hellbent on doing it in C++).
Especially when working with MVP's (or new projects in general), the ability to leverage already existing tools and frameworks are key to rapid progress. This doesn't necessarily have to be scripting languages, but the Python/Node/Go/etc developer would have a working GraphQL server up and running connected to a database of choice within an afternoon while the skilled C++ developer would have to spend at least a few days implementing a GraphQL server mostly from scratch [0].
[0]: A quick Google show that schema parsers exists for C++, but nothing matching the frameworks/library available for more web-fashionable languages.
a) Writing a schema parser is not rocket science. In fact, for a good programmer implementing their own GraphQL library would be quicker than integrating some third-party library. So your first point ("average Python programmer would do it quicker than even a top-tier C++ programmer") is absolutely wrong.
b) There's no value in an MVP that does something generic that is already available in off-the-shelf libraries. Your GraphQL example is pretty pointless because it doesn't actually do anything.
> ...the Python/Node/Go/etc developer would have a working GraphQL server up and running connected to a database of choice within an afternoon
Well, no. By the end of the week they'll still be arguing about which package manager to use, whether TDD is a good idea, what makes a microservice 'micro' and how to configure Kubernetes.
Not really; hiring Python and Java devs is very amenable to keyword-driven recruitment. Hiring C++ devs in the same manner is a clusterfuck waiting to happen.
I have watched the ocaml community go through a renaissance when for a while it looked like it was moribund, and the D community looks like it is developing the same undercurrent of momentum. given that D is a great language and the implementation looks solid, I don't think it is in danger of dying.
The analogy here is one of the best ice climbers in the world proposing an ascent of the Matterhorn. Please use testable, easy to prototype with, memory managed languages for production servers unless you are solving a very specific problem and really know what you are doing.
Very apt. I think people are also forgetting “the bad old days” where you worked on a large-ish cpp or java app for a year, nothing worked right, schedules slipped, and then the whole thing was scrapped and teams disbanded to work on other stuff. That was very common. You can’t count on having a team of Carmacks work on your blub app.
> ...My bias is that a lot (not all!) of complex “scalable” systems can be done with a simple, single C++ server.
The second tweet of the discussion:
> JAVA or C# would also be close, and there are good reasons to prefer those over C++ for servers. Many other languages would also be up there, the contrast is with Really Slow (but often productive and fun) languages.
I'm afraid that Carmack has sided with his own anecdotal experiences of the 1990s to justify the use of C++ for server-side development for the 21st Century. This probably made sense at the time due to the availability of more C++ devs and less language choices, but today in the 2020s? I remain totally unconvinced by his argument.
He goes on to suggest Java or C# which still makes sense for many companies for generic server-side development if you are after a more secure backend. Kotlin is pretty much the most sensible for this. But for the sake of Carmack's engineering background however, it is unsurprising why Java/C#/Kotlin are technically unsuitable for high-performance gaming platforms if one was to create one. So what credible languages could be used to compete with C++? I hear Discord is having a great time with using Elixir (Erlang could also be used) and another gaming platform called 'Hadean' is using Rust for their platform.
"But for the sake of Carmack's engineering background however, it is unsurprising why Java/C#/Kotlin are technically unsuitable for high-performance gaming platforms if one was to create one."
This really is just not true.
Financial, real-time style High Frequency Trading apps are often written in Java - not C++.
Much of the JVM is not a VM, it compiles to machine code - in an optimised manner. For starters.
Given how difficult it is to develop safely in C++ I can hardly think of a reason to ever use it on the backend.
Trading apps generally process a small amount of data. Graphs are downright lightweight compared to what a 3D game pumps through.
Generally, for a 3D game manual memory management and explicit data layout are critical. For example, it's common to use custom memory allocators with a region for each frame, a region for each loaded level, etc... This is then much cheaper to simply drop on the floor than any kind of object-by-object cleanup, whether that is reference counting, garbage collecting, a traditional heap, or whatever. Even Rust can't yet compete with this!
Similarly, many game engines use ECS systems or in-memory columnar data layout to (structure of arrays instead of arrays of structures) to enable SIMD instruction sets such as AVX.
Java can be coerced into doing much of the above, but it generally takes a ridiculous effort to approach what comes nearly effortlessly with a language like C++ or Rust.
Even C# is a better choice than Java, as it monomorphises more code and recently had a range of extensions[1] added to reduce GC pressure such as stackalloc, Span, Memory, MemoryPool, SequenceReader, ValueTask, etc...
I've bought and played several games written in C#, but other than Minecraft I'm not aware of any popular real-time games written in Java in the last 15 years or so. Meanwhile, Minecraft is not at all smooth on my very high end gaming PC in 2019 despite being 8 years old. (I'm sure this can be eliminating by tweaking some settings, but it's indicative of the problem.)
I say this as someone who used to professionally develop browser-based Java games back in the early 2000s and had to personally jump through hoops to reduce heap allocations to avoid GC pauses.
Yes thanks for that but I think author was referring to game synch server not actual games. Synching minimal state data can use small data structures etc. But thanks though great comment.
If anything the server-side coding is harder. Many games do the full physics/simulation on the server to minimise cheating, and have to simulate from every player's perspective. Meanwhile the clients have a single perspective and most of the computation effort is offloaded to the GPU.
Additionally, most multiplayer games have the same codebase for the server and the client for the obvious reasons. Single player is literally "online play" with an in-memory channel to a local server.
All Quake-based games work this way, including the derivative Source-based games and a bunch of other engine variants. Unreal works this way too if I remember correctly.
You can't realistically write a client in C++ and a server in Java. You'd be practically doubling your development time!
> Financial, real-time style High Frequency Trading apps are often written in Java - not C++.
They're written in a very specific style of Java, where you take pains not to allocate, which looks a lot more like C++ than idiomatic Java. Operationally, strange rituals are followed to avoid recompilation or other otherwise normal runtime activity during trading hours. The app will be started on sunday night, and the team spends the week praying it doesn't need restarting during the week.
> Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email us and we'll look at the data.
This is way less in HW than most people in the trade (from web devs to devops) seem to think when asked about it.
SO ranks #36 in Alexa right now: https://www.alexa.com/siteinfo/stackoverflow.com