A lot of complex “scalable” systems can be done with a simple, single C++ server

IndrekR · on Dec 29, 2019

A site for proof. It keeps amusing me on what hardware/software Stack Overflow/Stack Exchange is running on: https://stackexchange.com/performance

This is way less in HW than most people in the trade (from web devs to devops) seem to think when asked about it.

SO ranks #36 in Alexa right now: https://www.alexa.com/siteinfo/stackoverflow.com

riyadparvez · on Dec 30, 2019

One thing to keep in mind that their work-load is very ready-heavy which eases things a lot when scaling the system. The same is true for Wikipedia. Scaling a write-heavy workload is way more complex than scaling a read-heavy workload.

manigandham · on Dec 30, 2019

SO content changes all the time. Votes, comments, moderation, edits, tagging, search and recommendations, etc. There are also real-time community features. It's not as simple as it seems.

RhodesianHunter · on Dec 30, 2019

They said read heavy, not read only.

I guarantee you the bulk of traffic to SO is hitting a page and performing zero writes.

manigandham · on Dec 30, 2019

They log every single pageview to SQL Server. There are plenty of writes to various other counters, recommendation system, message inboxes, analytics, etc. Also the ads system, although I'm not sure how much of that is still 3rd-party.

Yes it's read-heavy but there's still plenty of work done in assembling a page. It's definitely not as simple as caching at the CDN edge for every hit.

dvirsky · on Dec 30, 2019

IIRC most of the "real time" stuff is done on Redis, for which this QPS load is usually a joke (depending on requests of course, but counters etc are easy)

fock · on Dec 30, 2019

so, would you mind showing any non-FANG, but write heavy site? Besides that, even the operation of most social networks should be very easy to be split into town-sized instances with batch-updates in-between instances (see mastodon specs https://github.com/tootsuite/documentation/blob/master/Runni...). What actually costs a lot is constantly surveilling all user interactions, data-warehousing that and playing out ads. None of which is necessary, and a 20ct/per user/per month should cover the actual service nicely...

jjeaff · on Dec 30, 2019

Most any saas offering is going to be very write heavy in comparison.

rbanffy · on Dec 31, 2019

All that can be implemented as separate services. Doing that also enables graceful degradation, as a failure of, say, voting doesn't prevent me from reading the question.

hliyan · on Dec 30, 2019

There is a significant amount of question submission, commenting and voting going on on SO.

yawn · on Dec 30, 2019

A slim subset of humans typing things is not what I would label "write heavy". Write heavy is more like 100k+ devices out in the field sending their current position every 10s. That's still very manageable, but requires some thoughtful design.

Retric · on Dec 30, 2019

10k simple updates per second is hardly high bandwidth. I knew people handling similar workloads on a 1U server 10 years ago.

pushpop · on Dec 30, 2019

Indeed and I literally was. And the back end was written in mod_perl (which isn’t exactly known for performance). Even back then 10k/s was our smallest service - both on income as well as traffic - so there wasn’t the incentive to rewrite it in a faster language.

stefan_ · on Dec 30, 2019

The Raspberry Pi 3 in my cupboard without a heat sink on can do 80000 Redis write ops/sec.

fnord77 · on Dec 30, 2019

> That means we transfer 55 TB data / month

this isn't a lot. Helps that site is mostly text data.

I know of a relatively small cloud security system that transfers petabytes/month to/from a handful of customers.

4 ingest pods, 8 pipeline pods, 7 time-series db servers, 2 sql servers

manigandham · on Dec 30, 2019

The amount of data transferred is meaningless to compare. SO is a relatively complex site with dynamic content and real-time features.

Compared to other similar sites like Reddit or Quora which are far slower yet running on more hardware, it shows what proper efficient architecture and code can do.

foota · on Dec 30, 2019

Reddit threads are also much hotter for writing and have O(thousands) of replies.

manigandham · on Dec 30, 2019

Sure Reddit has more write activity but I don't see the number of replies being such a big factor. SO questions have answers, comments, tags, related questions, and many other secondary data to load. Reddit is slow due to poor architecture and a terrible frontend.

onlyrealcuzzo · on Dec 30, 2019

Old Reddit isn't slow. It seems to mostly be the new frontend.

shakna · on Dec 30, 2019

Old Reddit is still slow, but not too bad. Reddit's new software stack is... Atrocious.

markus_zhang · on Dec 30, 2019

I recalled that back in the days when the new Reddit has been rolled out, there were quite a few days that it's difficult to open in mobile browser. Now it's much better.

jhgg · on Dec 30, 2019

SO is really not a great example of a high traffic site. 5500 req/sec of mostly read only traffic is not that crazy at all, and their hardware footprint is incredibly over-provisioned for the workload. I don't really think their example stands well here.

For example, at work, our entire analytics ingest workload (HTTP) for a few hundred million users runs on 8 core VMs on GCP, written in Rust/Go, each node doing ~40k events/second.

Our RTC infra sustains >1m PPS per node on 4 core 3.9ghz 2014-2017 xeons.

_3lin · on Dec 30, 2019

Hi, I'm curious because I plan to rewrite a Rust service to Go (development velocity is too slow) Which part of your service is in Rust amd which in Go? Do you think that if it would be only in Go it could sustain such a load?

zozbot234 · on Dec 30, 2019

What specific ergonomics pitfalls did you experience in Rust? "Development velocity is too slow" is a bit vague; beyond the use of GC, which only really matters in specialized domains, there's not much reason to think that rewriting your service in Go would give you better 'development velocity'.

kick · on Dec 30, 2019

There is, though. Go was designed to be easier to reason with than conventional ALGOL-derivatives, while Rust wasn't.

To quote Rob Pike:

The key point here is our programmers are Googlers, they’re not researchers. They’re typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They’re not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt.

Go is aiming for the same niche that Python is. This has the accidental side effect of making it faster to write software in it than most of the ALGOL-derivatives. To quote Eric Raymond (sorry) on Python:

When you're writing working code nearly as fast as you can type and your misstep rate is near zero, it generally means you've achieved mastery of the language. But that didn't make sense, because it was still day one and I was regularly pausing to look up new language and library features!

Go and Python both allow for expression about as fast as you can type, by virtue of being designed for [children in Go's case, shell scripting in Python's case].

It's not a value judgement of Rust or anything, but Rust wasn't designed with the same goals in mind.

zozbot234 · on Dec 30, 2019

> Go was designed to be easier to reason with than conventional ALGOL-derivatives, while Rust wasn't.

This may be a problem of what's idiomatic in each language, as opposed to a matter of language design per se. After all, Rust development can be made at least as easy as, e.g. Swift, simply by adding enough uses of .clone() and RefCell<>. Is this suboptimal? Of course, but it will still be plenty faster than Python, and perhaps even faster than Go.

Compilation time is a separate issue which apparently OP found problematic. It's being dealt with (for non-release optimized builds) via the cranelift project, which is a Rust-specific backend much like the Go compiler, with no reliance on LLVM.

kick · on Dec 30, 2019

This may be a problem of what's idiomatic in each language, as opposed to a matter of language design per se.

It's absolutely a matter of language design. Again, I'm not criticizing Rust, but a language explicitly designed to be trivially written by anyone, fast, is going to be easier to write in quickly by anyone. Imagine making that comment about Logo instead of Go. Of course Logo is more painless to write than Rust! It's a child's language. So is Go!

Rust development can't be made as easy as Swift (and it is very unlikely that what you described would be faster than Go). Even Graydon Hoare agrees that Rust is inadequate in comparison to Swift in terms of development ease:

https://www.reddit.com/r/rust/comments/7qels2/i_wonder_why_g...

I'm no stranger to languages with vastly different idioms than normal (I write APL daily), but not all languages are as quick to develop in as every other, and pretending they are is silly. Rust has some innovations, and it's by no means a bad language in itself, but pretending it wins at everything under the sun (even things it's not trying to do) doesn't reflect reality or the perspective of the original author.

erik_seaberg · on Dec 30, 2019

If you can write a ton of code without thinking very much, you're probably writing boilerplate that should have been generated from a human-level description of the problem. Your job is to only spend time writing what needs to be written.

kick · on Dec 30, 2019

Languages that allow you to write as fast as you think are a blessing.

"Write as fast as you think" is a far better way to program than "Write much slower than you can think."

Eric Raymond has written some pretty substantial things, and he's not as clueless (on programming, at least, the rest of his views are...no) as you're implying.

The idea that intuitive languages are the only ones you should do development in is absurd. A single line of K can do what a hundred lines of C can, and you can write the line of K substantially faster than you could write the C to match. K only has something like 50 primitives. It's simple enough that you can keep it all in your head at once, and that allows you to develop much quicker than almost any ALGOL-derivative. Taking your comment at face value, everything written must be boilerplate. Looking at reality paints a different picture.

Good languages manage complexity in a way that allow you to express complex things in simple terms. That the languages you seem to be familiar with only allow you to describe simple things in simple terms isn't something that's inherent to every programming language. I'd recommend giving APL, J, or K a try.

erik_seaberg · on Dec 31, 2019

I'm all in favor of concise and expressive languages (even weird ones). I hate being slowed down by the language itself. But writing the first thing that comes into my head leads me to reinventing the wheel a lot, and (at least at work) we have a responsibility to find reusable abstractions and only create new code that's needed.

kick · on Dec 31, 2019

I disagree with that. If your code base is small enough, a few redundant lines (idioms) don't matter.

To bring up k again, the language has already done about the maximum amount of abstraction possible. There's really no room for the programmer to make reusable abstractions; any useful ones have already been made. That allows you to do very useful things in very small amounts of code. Picking a random example, here's a complete Sudoku solver in 75 bytes:

http://nsl.com/k/sudoku/aw3.k

kick · on Dec 30, 2019

Typo: "The idea that unintuitive languages..."

_3lin · on Dec 30, 2019

By development velocity I mean implementing new features, from idea to deployment.

Rust is slow to compile so it breaks my deep work when programming, also it costs me a lot in CI/CD. Also the Rust type system make implementing some things really hard.

For example I wanted to implement json requests logging. It took me more than 1 day in Rust, less than 2 hour in go.

panpanna · on Dec 29, 2019

Note that they use C# and ASP.Net...

(SO and DailyWTF are very much into Microsofts ecosystem)

tyuioyjj · on Dec 30, 2019

AFAIK they're using ASP.NET Core

There's significant difference between ASP.NET and ASP.NET Core

https://meta.stackexchange.com/questions/316278/the-road-to-...

Merad · on Dec 30, 2019

They use Core today, but SO was using .Net long before Core existed.

FpUser · on Dec 30, 2019

Anything wrong about that?

panpanna · on Dec 30, 2019

No, I'm just pointing out that they are different from your average SV company.

pjmlp · on Dec 30, 2019

However they are quite average regarding typical European corporate stacks, either Java or .NET.

fasteo · on Dec 30, 2019

Any reason for such a low (<5%) average CPU usage? It seems like a waste of resources to me; that's assuming a "normal" CPU usage read that includes wait i/o time.

perl4ever · on Dec 30, 2019

At the risk of exposing my ignorance - look at all those "Peak 5%-20%" labels. Doesn't that mean they have a lot more than they need?

dahfizz · on Dec 30, 2019

Stack overflow hosts all (most?) Of their own baremetal servers in their own data center.

Looking at the specs of the machines, they are actually pretty basic as far as servers go. A server isn't barely worth the cost of it's chassis and motherboard if you put less than 64 G ram and 24 CPUs in it.

In other words, these are about the lowest specd proper servers you can get. So yeah, even their modest hardware is still overspecd for running their website.

latch · on Dec 30, 2019

"Their own data center" implies (to me at least) that they built their own data center. That doesn't sound right, so I looked it up, and it seems like they're colocating. That might be what you meant, but the data centers they use certainly weren't built or owned by SO.

dahfizz · on Dec 30, 2019

As far as I can tell, you are correct. They manage their own metal servers/racks inside a colocated data center. Sorry if this caused confusion.

0x445442 · on Dec 30, 2019

Yeah, I recall an article that said their physical hardware was in jeopardy during hurricane Sandy.

paulsutter · on Dec 30, 2019

As CPU utilization goes up, latency will go up. Also, one or two second snapshots of CPU use are averages over time. During any given moment you will invariably get little bursts of requests, and those will have slow responses if running closer to full utilization.

Also, of course, the total cost per hour for 9 servers with 3 year depreciation is in cents.

daxfohl · on Dec 30, 2019

Nick Craver went into that a bit here. https://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-...

vitus · on Dec 30, 2019

They have a lot more CPU than they need, but the challenge is often getting servers shaped appropriately (RAM x CPU x disk x network), especially if you're not building your own.

Ultimately, their service is (most likely) not compute-bound.

manigandham · on Dec 30, 2019

Yes but the overall hardware is a one-time purchase and cheap considering all the other costs of the business. They're well provisioned to keep latency down and handle any unexpected outages.

hubert1234 · on Dec 30, 2019

600k open websocket connections means that 600k people opened stackoverflow sites but dont click on anything right? Because they still only have 550req/s. Interesting how much more power you need just to keep track of state.

29athrowaway · on Dec 30, 2019

Plus a content delivery network.

_dw7s · on Dec 30, 2019

> SO ranks #36 in Alexa right now

It would rank way higher if they would be more webscale. It needs more Kubernetes, GraphQL and Golang /s

jandrewrogers · on Dec 29, 2019

Many developers severely underestimate how much workload can be served by a single modern server and high-quality C++ systems code. I've scaled distributed workloads 10x by moving them to a single server and a different software architecture more suited for scale-up, dramatically reducing system complexity as a bonus. The number of compute workloads I see that actually need scale-out is vanishingly small even in industries known for their data intensity. You can often serve millions of requests per second from a single server even when most of your data model resides on disk.

We've become so accustomed to extremely inefficient software systems that we've lost all perspective on what is possible.

hnews_account_1 · on Dec 30, 2019

Can you expand on this? I have some pretty massive compute loads that need to be scaled onto a cluster with 100+ workers for most computations. This is after I use a library called dask that graphically does its own mapreduce optimisation inside its modules. This is all for a relatively small 250GB raw data file that I keep in a csv (and need to convert to SQL at some point).

Are you saying this can be optimised to fit inside a single 10 core server in terms of compute loads?

sfifs · on Dec 30, 2019

Don't know why you're being downvoted but I'll assume your question is genuine.

You use a cluster when your data and compute requirements are large and parallel enough that the tax paid on network latency trumps the 10-20X speedup you get on SSD and 1000X speedup you get from just keeping data in RAM.

250 Gigs is tiny enough that you could probably much get better performance running on high memory instance in AWS or GCP. You'll generally have to write your own multiprocessing code though which is fairly simple - your existing library may also be able to support it.

I once actually ran this kind of workload on just my laptop using a compiled language that performed better than pyspark on a cluster.

hnews_account_1 · on Dec 30, 2019

I'd love to keep it in RAM if I could. The problem is, the library I'm familiar with (pandas) typically seems to take more memory than the original csv file once it loads it onto memory. I know this is due to bad data types but in certain cases, I cannot get around those.

However, even if I could load it all into memory at once, and assuming it takes 200 gb, I'm still using a master's student access to a cluster. So I get preempted like it's nobody's business. Hence why I prefer a smaller memory footprint even if I take up cpus at variable rates through a single execution.

I did try to write my own multiprocessing code for this, but the operations are sometimes too complicated (like groupby) for me to rewrite everything from the ground up. If I'm not reliant on serial data communication between processes (like you'd need to sort a column), I can get it done pretty easily. In fact, I wrote my data cleaning code with this and cleaned up the entire file in half an hour because single chunks didn't rely on others.

However, if you have some idea of how to run these computational loads in parallel in python or any other language on single compute instances (like the size of a laptop's memory of 16 gb), I'd really love to see it. Thanks.

giantrobot · on Dec 30, 2019

Numpy supports memory mapping `ndarrays` which can back a DataFrame in pandas. This lets you access a dataset far larger than will fit in RAM as if it lived in RAM. Provided it's on fast SSD storage you'll have speedy access to the data and can process huge chunks at once.

hnews_account_1 · on Dec 30, 2019

Can you provide a link to this please? My current knowledge is that all numpy data lives in memory, and pandas itself has a feature to fragment any data into iterables so I can read upto my memory limit. I cannot use this feature due to the serial nature of some of the operations that I alluded to (I'd have to almost rewrite the entire library for some of these complicated operations like groupby and sorting).

I do have fast SSD storage because it's on the scratch drive of a cluster and from what I've seen it can do ~300-400 MB/s easily. I haven't had a chance to test more than that since I'm mostly memory constrained in much of my testing.

My current attempt is to push this data into a pure database handling system like SQL so that I can query it. But like I said, I work with a less-than-stellar set of tools and I have to literally set up a postgres server from ground up to write to it. Which shouldn't be a big deal except when it's on a non-root user and I have to keep remapping dependencies (took 5-6 hours to set it up on the instance I have access to).

My other option was to write the entire 250 GB to a sqllite database using the sqlalchemy library in Python, but that seems to fail whether I do it with parallel writes, or serial writes. In both cases, it fails after I create ~64-70 tables.

giantrobot · on Dec 31, 2019

https://docs.scipy.org/doc/numpy/reference/generated/numpy.m...

You can create memory mapped ndarrays, these act like normal numpy arrays but don't need to fit into RAM. Numpy maps the array to a binary file on disk. The array otherwise acts like an ndarray so you can build a DataFrame with it. Whenever you access an array index Numpy in the background (essentially) seeks that many values into the file to grab the value of that index.

Since you're on a fast SSD and Numpy is fairly smart you'll be able to access your arrays close to your drive's speed. It's slower than if the whole database was in RAM but far faster than distributing the data over a network to a bunch of worker nodes. Memory mapped files let you have array-like access to data on disk as if it lived in RAM. When building a pandas DataFrame from a memmapped ndarray I believe you just need to set copy=False in the constructor for it to Just Work.

I don't know what your data looks like but I doubt loading it into SQLite is going to improve your performance.

hnews_account_1 · on Dec 31, 2019

Unfortunately my data isn't all numbers. It has text too. The sparse examples in that link only show this for reading in numbers. Do you know off hand if it translates well? There is a dtype parameter, but it'll take me a few days to get back to this code, so I figured I'd check beforehand.

Your second paragraph is essentially what I want. I'm willing to wait a day for code that may run in 1 hour from memory, so time isn't entirely an issue unless it's starting to bleed into weeks. The read_csv function in pandas has a parameter called memory_map, but when I tried using it on a smaller 7GB dataset, it read the whole thing into memory (32GB instance) even when I set it to True.

SQLite is definitely not my best option here. It was the only server-less implementation I could find, so I tried to use it and it didn't work. However, a database like implementation will be helpful because each operation I need to do will require data that satisfies certain timestamp and arithmetic conditions. I figured it'd be best to load the whole thing into a db and query it for every operation to train my model.

sfifs · on Dec 31, 2019

I'd spin up something like AWS r5.16xlarge node for the processing just for this and shut it down after use - should cost few 10s of dollars per run or so. Of course in some corporate environments, this option may not be available to you.

westurner · on Dec 31, 2019

Dask groupby example: https://examples.dask.org/dataframes/02-groupby.html

> Generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations, just more scalable.

The dask.distributed scheduler can also run on one high-RAM instance (with threads or processes) https://docs.dask.org/en/latest/setup.html

Pandas docs > Ecosystem > Out-of-core: https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#...

Reading from Parquet into Apache Arrow is much faster than CSV because the data can just be directly loaded into RAM. https://ursalabs.org/blog/2019-10-columnar-perf/

If you have GPU instances, cuDF has a Pandas-like API on top of Apache Arrow. https://github.com/rapidsai/cudf

> Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

> cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

Dask-ML makes scalable scikit-learn, XGBoost, TensorFlow really easy. https://dask-ml.readthedocs.io/en/latest/

... re: the OT: While it's possible to write C++ code that's really fast, it's generally inflexible, expensive to develop, and dangerous for devs with experience in their respective domains of experience to write. Much saner to put a Python API on top and optimize that during compilation.

There are a few C++ frameworks in the top quartile of the TechEmpower framework benchmarks. https://www.techempower.com/benchmarks/

Hardware/hosting is relatively cheap. Developers and memory vulnerabilities aren't.

hnews_account_1 · on Dec 31, 2019

Unfortunately, I'm not sure what's wrong with dask, but it doesn't work properly on my cluster. I tested it on an exceedingly simple operation - find all unique values in a very big column (5 billion rows, but I know for a fact that there are only 500-502 unique values in there). With a 100 workers, it still failed. Now this is an embarrassingly parallel operation that can be implemented trivially. So I'm not sure if there's a problem with my cluster or if dask just does not work with slurm clusters very well.

westurner · on Jan 1, 2020

https://docs.dask.org/en/latest/setup/hpc.html says dask-jobqueue handles "PBS, SLURM, LSF, SGE and other resource managers"

"Dask on HPC, what works and what doesn't" https://github.com/dask/dask-blog/issues/5

Maybe you should spend some time developing a job visualization system for end users from scratch, for end users with lots of C, JS, and HTML experience https://jobqueue.dask.org/en/latest/interactive.html

rumanator · on Dec 31, 2019

> You use a cluster when your data and compute requirements are large and parallel enough that the tax paid on network latency trumps the 10-20X speedup you get on SSD and 1000X speedup you get from just keeping data in RAM.

Nowadays there's another reason to use clusters: to autoscale your expenditure wrt workload. A little inefficiency might be acceptable if you don't have to pay for a huge beefed up server idling at, say, 30%.

jstrong · on Dec 31, 2019

note: doing anything on a 250gb file in python will require a lot more ram than 250gb. generally my expectation is I will need 10x the ram as the size of the file when using pandas, for when I accidentally do something that triggers pathological behavior.

giantrobot · on Dec 30, 2019

You have 250GB of "raw" data stored in CSV format. The parsed version of this data in memory is likely to be a fraction of the on-disk size. A `long` or `double` only take up eight bytes in memory but 10-20 bytes on disk stored as ASCII in a CSV file. Even if your raw data was 250GB you could store it in memory mapped files. A fast SSD can easily hit a gigabyte per second sequential read speed, far faster than your typical network.

Segmenting your raw data and using memory mapped files will let you work with large data sets without needing huge amounts of RAM. From there it's a question of your single system's processing speed and IO capacity. This is only necessary if your processing needs random access to the entire dataset.

If your CSV data is more like a streaming data source, you're processing each record as it's read in, you can just stream it in through `stdin`. At 1GB/s you're looking at five minutes or so to process your 250GB of raw data. A SATA SSD might take twenty minutes to stream that raw data.

inetknght · on Dec 30, 2019

> A fast SSD can easily hit a gigabyte per second sequential read speed, far faster than your typical network.

It's important to note that often your disks aren't directly attached to your compute. That's frequently the case in (particularly cheap) cloud instances.

giantrobot · on Dec 30, 2019

That's the thing, you don't necessarily need a bunch of cloud instances to process data. If you must be doing everything in "the cloud" every service has dedicated instances with fast attached storage available. You can spin up one long enough to rip through your data instead of trying to distribute it over hundreds of workers.

It's also a domain where you can buy an off-the-shelf desktop for a few hundred dollars to do the work. That's the thrust of this whole thread, because scalable "cloud" systems exist and look cheap people obsess about throwing more instances at problems.

Modern commodity systems are ridiculously powerful and far more capable than people tend to assume. Even "the cloud" gets underestimated because people look at the low end cheap instances and assume they need to spin up hundreds of those when one beefy image for a short duration could do the same work.

fock · on Dec 30, 2019

so, a hundred of the cheapest cloud instances cost you what? 1000 USD per month? You can easily get a deskside workstation for 10000 USD with a nice 2TB NVME-SSD, a 32-Core Threadripper and 128GB of memory... Utilization might be lower of course, but even if it's only utilized 8h/day this seems like a bargain for any solid business. For any startup aiming for exponential growth by burning through cash and having a 15month "half-life" this is not gonna work out though

masklinn · on Dec 31, 2019

> A fast SSD can easily hit a gigabyte per second sequential read speed, far faster than your typical network.

Nit: 1GB/s is ok, not even solid let alone fast.

A fast SSD pretty much saturates 4x3.0 links (which explains why they universally tend to cap out at 3.5GB/s). In fact there are now a few PCIe4 SSDs (e.g. Corsair's MP600) which close in on 5GB/s.

giantrobot · on Jan 2, 2020

I got the 1GB/s from my laptop I was writing the comment on. The internal NVME drive tops out at 1.5GB/s. I consider that fast but as you point out there's drives that make mine look slow.

pmlnr · on Dec 30, 2019

250GB data is tiny. There are laptops with 128GB RAM, let alone servers.

Obligatory read: https://aadrake.com/command-line-tools-can-be-235x-faster-th...

hnews_account_1 · on Dec 30, 2019

Yes, but I'm heavily constrained because my access to the cluster I'm using is very low level and I get pre-empted quite a bit in my tasks. I'm probably over stretching between the amount of data I need to handle and my severe lack of skills (I'm a quant in training).

inetknght · on Dec 30, 2019

> Are you saying this can be optimised to fit inside a single 10 core server in terms of compute loads?

I'm currently employed to write software which does DNA analysis. DNA is known for being Big Data.

Some applications are very compute intensive and others are very data-relation intensive. The very compute intensive applications process about 1GB of data in about 30 minutes on a 32 core Xeon 6xxx with 32GB of RAM assigned to it. The data-relation intensive application processes about 350GB of data in 15 minutes on a single core of the same CPU but with 700GB of RAM assigned to it.

Both are heavily optimized but in different ways. So without knowing more about what you're doing with that data, it's hard to say.

hnews_account_1 · on Dec 30, 2019

So my entire dataset is ~24 x 250GB files. That 24 number can be larger if I can find an efficient way of processing each 250GB chunk. Each 250GB chunk is actually stock tick data so it has 500 stocks inside it. A heavily traded stock takes up ~10 GB of memory while a very thinly traded stock can top out at just 700 MB.

While I hope that each 250GB chunk has everything in order and I can separate it cleanly, I don't trust it. So I broke up all the files by stock in an embarrassingly parallelised code I wrote and got 500 separate files. However, the problem is, I'd want to process these 500 files parallely so each stock gets only so much memory (and hence, my original constraint remains). And for each stock, I want the data to be queried on timestamps, so I need a way to quickly say "I have data at 9am on Thursday, I want data from 4pm Wednesday to 8am Thursday to create a model".

I figured the best way to do this efficiently was to have the code create a massive database for all the stocks and query it efficiently in SQL. But I'm stuck there due to a lack of tools.

inetknght · on Dec 31, 2019

Your problem sounds quite similar to a lot of FinTech interview questions for software engineers. They're solved problems but solutions aren't easy.

I have no doubt I could solve your problems. I honestly don't care to do so here though.

I imagine there's a software development/engineering team at your work or school you could ask for guidance though.

hnews_account_1 · on Dec 31, 2019

> I have no doubt I could solve your problems. I honestly don't care to do so here though.

That's fair. I did not think it'd be such a difficult problem when I first set out to do it myself. But every single turn leading to a dead end kinda bummed me out. I'm going to finally resort to the database method of storing all the data in one query-able file and work off of that.

vagab0nd · on Dec 30, 2019

That's very interesting. Do you mind sharing some more details on what those computations are? Where can I learn more about these DNA analysis use cases?

inetknght · on Dec 30, 2019

I've found that Wikipedia has really good high-level information for a lot of the subject matter. For use cases, the business sells direct-to-consumer DNA tests. I've worked on several of the analysis products' softwares.

The compute-heavy workload calculates edit distance [0] of short paired-end sequenced DNA [1] vs the human genome [2]. There is open source software to manipulate the FASTA/FASTQ [3] and SAM files [4] and run the calculations [5]. The aligned file is processed in a couple of minutes to genetic variation report [6] which is used for some of the analysis products that were purchased. One popular product will give you a haplogroup [7] which basically tells you where you are in a genetic tree.

The relationship estimator uses a different sequencing technology and basically consumes a CSV file from the sequencer's manufacturer. It uses a proprietary algorithm to calculate centimorgans [8]. That then gives relationship estimates between you and other people who've purchased the product.

[0]: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algor...

[1]: https://en.wikipedia.org/wiki/Next-generation_sequencing

[2]: https://en.wikipedia.org/wiki/Reference_genome

[3]: https://en.wikipedia.org/wiki/FASTQ_format

[4]: https://github.com/samtools/hts-specs

[5]: https://en.wikipedia.org/wiki/List_of_sequence_alignment_sof...

[6]: https://en.wikipedia.org/wiki/Variant_Call_Format

[7]: https://en.wikipedia.org/wiki/Haplogroup

[8]: https://en.wikipedia.org/wiki/Centimorgan

bagels · on Dec 30, 2019

Not who you asked, but, it's hard to say without knowing exactly what computation is being done, or how much of the time is spent on IO. If you organize that 250GB in ram the right way (cache coherency, right container types), and spend a lot of effort doing analysis of algorithm selection, you might be surprised how much you can get done on a single (large) machine with enough cores.

Narishma · on Dec 30, 2019

Cache locality, not coherency.

I'm curious as to why this seems like a common mistake (I've seen it a few times already in this comment section).

godelski · on Dec 30, 2019

I'm doing my PhD in HPC (part of my work is in situ stuff). One of the biggest problems is actually IO. But honestly, 250GB isn't that big. I haven't used dask, but I know people that do. I just avoid python for anything I need performance for, C++ is just always going to be faster.

Biased: you might want to look into DOE libraries. For IO I suggest ADIOS2 [0]. There's python bindings too.

One of the biggest things you can do is use a different storage type like BP (ADIOS) or hdf5. These are readable but binary. But to really determine how to speed up your problem you have to know where the bottleneck is. Is it IO or compute? With 100 workers (threads or nodes?) you aren't highly parallelized. I mean that could be a single node if it's threads.

[0] https://github.com/ornladios/ADIOS2

roel_v · on Dec 30, 2019

If there are floating point numbers in those csvs, I sped up a system like that 10x just by writing a custom (Java equivalent of) atof() that didn't do variable decimal separators and scientific notation. That's not even counting the improvements in I/O speed from the size reduction. Any system that works from CSV's is going to be slow. I don't know what sort of computations you're doing of course, but I did all the work on a laptop, couldn't be bothered to scale it out after the improvements I made in the first pass. How much of your code spends its time in I/O (including conversions) vs actually calculating?

tkyjonathan · on Dec 30, 2019

Not sure which DB you are using, but you can load the csv file into the DB directly on a single thread using something like LOAD DATA INFILE.

If you have some good indexes and do some push-down work (give the database aggregation tasks to do instead of your python code), you should probably be more than fine.

For a 250Gb file.. should be ok.. maybe add some partitioning too.

hnews_account_1 · on Dec 30, 2019

I'm open to using any db that I can query over some engine with a python implementation. So any SQL db should be fine. However, I don't know how to convert a csv to an SQL directly. Is the command you mentioned part of some SQL server package? Sounds like it's exactly what I need.

nwallin · on Dec 31, 2019

pandas can read from a CSV file and then write to SQL. Even if you don't go the SQL route, you'd probably gain significant benefits by working with HDF instead of CSV.

https://pandas.pydata.org/pandas-docs/version/0.22/generated...

https://pandas.pydata.org/pandas-docs/version/0.22/io.html#w...

vchak1 · on Dec 30, 2019

Need to understand your domain better, but in many cases, the 250GB csv can be compressed down quite effectively using a columnar representation. And the columns can (potentially) be processed using simd/gpu based approaches to where a single server would outrun a cluster. Food for thought..

pleasecalllater · on Dec 29, 2019

Two. The number is two. You always need a backup server :)

twic · on Dec 29, 2019

I used to think that. But there are a couple of notable exceptions at either end of the latency spectrum.

If your latency requirements are slack, then you can get away with one machine, because you can reboot or reprovision it and carry on processing without meet your requirements.

If your latency requirements are tight, you don't have time to fail over anyway, so you might as well run one machine and make sure you can deal with failing to meet your requirements.

chucky_z · on Dec 30, 2019

This is why I ran Redis on a single server with no failover strategy. Our team spent maybe 15 hours workshopping a handful of other things (redis cluster, sentinels + replicas) before realizing we could spend an hour to have everything sit in a degraded state working around it while Redis itself was fixed. Redis only failed once ever and it took all of an hour to fix it, all during non-peak hours.

dboreham · on Dec 30, 2019

Three. So you can do maintenance on one while still having HA.

naniwaduni · on Dec 30, 2019

No, that's still two. You don't do maintenance when you're not awake.

anon73044 · on Dec 30, 2019

I was expecting a "two is one and one is none" reference. Three keeps the investors happy in my case.

paulddraper · on Dec 30, 2019

*Sometimes three.

Go back to two when maintenance is done.

fizwhiz · on Dec 30, 2019

This guy SRE-s!

remeq · on Dec 30, 2019

But isn't the containerization trend leading us to a completely opposite direction ie. scale-out by default and do that not only because of performance but mainly because of how you want to manage your production environment?

sriram_malhar · on Dec 30, 2019

This is precisely the point made by McSherry, Isard and Murray in their lovely paper, "Scalability! But at what COST?" (Usenix HotOS '15). They demonstrate how much performance headroom there is in modern CPU and memory, and show how simple cache-sensitive batch algorithms running on a single core can outperform hundreds of cores running distributed map-reduce style jobs.

https://www.usenix.org/system/files/conference/hotos15/hotos...

jpz · on Dec 30, 2019

Big data is about I/O not CPU

I’m a C++ veteran btw and understand the point but big data is about how to process petabytes of I/O not how to consume CPU.

bane · on Dec 30, 2019

This is true in a "water is wet" kind of way. The point is that a great many problems that can fit neatly into a single machine are being turned into I/O problems by being distributed onto clusters.

There's an incredible number of gigabyte and even terabyte scale problems that are consuming racks of blades when a little thinking and understanding of the problem being solved can be done pretty nicely on far fewer resources.

What's really happening is that to many people they think it's "easier" to simply rack more equipment into the cluster and end up shifting the complexity into cluster administration rather than programmer time.

mannykannot · on Dec 30, 2019

>...and end up shifting the complexity into cluster administration rather than programmer time.

Sometimes, that is the right thing to do. The problem is when the cluster solution also adds to the programmer-time complexity.

growse · on Dec 30, 2019

> Big data is about I/O not CPU

> I’m a C++ veteran btw and understand the point but big data is about how to process petabytes of I/O not how to consume CPU.

I'm not sure this is cut-and-dry. Back when I was working on Spark workloads, there was some interesting research being done on where the bottlenecks were for jobs. I think it turned out for a lot of jobs, infinite disk / network io didn't give as much of an improvement as you'd expect.

https://databricks.com/session/making-sense-of-spark-perform...

sriram_malhar · on Dec 30, 2019

Yes, you are right. That said, the COST paper's presumption is that not every "big data" problem is big (petabyte size), that most probably fit in a single system's disk & memory.

emsy · on Dec 30, 2019

Correct me if I’m wrong, but isn’t the point of writing cache coherent code the fact that memory I/O is the bottleneck these days?

philipov · on Dec 30, 2019

Not for everyone. In finance, a query might only use gigabytes or terabytes of data, but need to do a ton of simulation and calculation on top of that. Optimization of e.g. trading algorithms is entirely CPU-bound.

jpz · on Jan 13, 2020

Yes, I'm aware. I've written and maintained several of these systems in exotic derivatives space. Scale-out of CPU was required for these.

These are not big data problems, however, by definition.

zozbot234 · on Dec 30, 2019

Even CPU is about "I/O" these days. Memory (RAM) is the new disk - and memory bandwidth is generally the performance bottleneck in heavy workloads, especially in multicore. This might be one reason why loosely 'C-like' languages like Rust are going back in style. High-level languages are terrible for memory bandwidth.

wiradikusuma · on Dec 30, 2019

But nobody in BigCo(tm) would do that, because everyone (including the executives) involved want to have Hadoop/BigData(tm) in their resume. /sarcasm

tlb · on Dec 30, 2019

No doubt there are people who do it for cynical reasons. But at least some people do it sincerely thinking it’s the right choice. It’d be more interesting to talk about them, and how they came to make the wrong decision for what they thought were the right reasona.

asdfasgasdgasdg · on Dec 30, 2019

I think it's a bit presumptuous to say that the decision is "wrong". That paper demonstrated that a single server can outperform a small server farm on a toy problem. Nobody, not even google, solves pagerank in production as a batch job. Real problems are often more complex.

tluyben2 · on Dec 30, 2019

For many workloads, companies have tech leads and CTOs absolutely overarchitecting their stack. If you can run your entire system off of 4 loadbalanced $200/mo servers, why am I seeing hadoop/kafka/kubernetes/etc running on $10k+/mo premoney, so paid with investor money? Sure there are a lot of cases where this is fine, but I would say (from real life) that there are far more cases where this was a pretty poor choice. Usually proven to be even poorer when the tech lead leaves and no-one seemingly having a clue how it all works together, despite the whole docker/kubernetes/ci stories peddled to management. That is usually when I get asked to take a look.

In my experience these problems are definitely not always toy problems while some of them can, easily, be ran on 1 server for the expected lifespan of the company because the company will never get that much users/clients/data even though immensely profitable.

Not everyone (almost no one) is a FAANG and I find it offensive to use the company money (which can be investor money) to realise the wet dreams of the cto while the case for that architecture and the cost of it, makes 0 sense to the business and bottomline.

Obviously in life, things are nuanced and differ case by case but nah, real world problems are, generally, not more complex. They are more complex in some (rare) cases, like the one you named. But most companies are not doing anything like that, but they do have their cto’s gearing up for that incredibly unlikely future.

FpUser · on Dec 30, 2019

Talked to many people lit you mentioned. Large percentage repeats these points:

1) We always wait for database, so performance of our code does not matter. This comes from places where ironically the whole database can fit into RAM. 10+ gbps connectivity - well it is almost commodity for business so the latency is not much of a bottleneck. Fast IO to store data - well imagine array of Optane drives. Not very cheap but really peanuts for normal business. All this means that purposely built data server which is a heart of many type of business residing on decent computer, can be blindingly fast and most of the businesses will never ever outgrow it. I've written quite a few of such servers serving loads of businesses in NA so hopefully my experience is not irrelevant.

2) Scripting languages are so convenient to use and save so much time and this is what matters as the salary is a main expense. - Personally I use those mostly for management/deployment/etc.etc type of scripts and maybe to quickly test some small ideas. I clearly see their benefits there. Anything that resembles a product that would actually run business. Sorry but experienced developer can implement those just as fast as in any scripting language and it will save a ton on maintenance.

3) And finally do not do premature optimization mantra . - In my opinion switching from scripting language to something like C++ has nothing to do with premature optimization. Developer can be very productive with compiled languages as they also have megatons of libraries for any imaginable task.

kortilla · on Dec 30, 2019

> Sorry but experienced developer can implement those just as fast as in any scripting language and it will save a ton on maintenance.

This is just a no true Scotsman argument. For 10 years I’ve watched python/ruby shops drastically outpace projects in C++/Java shops. What you’re failing to realize is how trivial most apps are and how fast fully functional back ends can be created with frameworks in those languages (django/rails).

Your bit about maintenance is also bullshit. I’ve spent entire days peeling apart complex C++ code bases to make a small change to some core abstraction. Ability to maintain code is entirely up to how well organized and documented it is. It has nothing to do with “scripting”.

FpUser · on Dec 30, 2019

"What you’re failing to realize is how trivial most apps are and how fast fully functional back ends can be created with frameworks in those languages (django/rails)."

Sorry but I have the exact opposite experience. First of all I would not call line of business applications incredibly simple. They're rather quite complex business rule wise. I saw floors filled with web developers constantly writing /rewriting endless stream of scripts often without any meaningful attempts to organize the code. In one example I was writing Python (yes guilty it is good for this type of tasks) scripts to process literally thousands of files of their source code to find and properly replace database access methods of which single app had not less then 5.

Ability to maintain code is entirely up to how well organized and documented it is. It has nothing to do with “scripting”

This I can 100% agree with

AmericanChopper · on Dec 30, 2019

> What you’re failing to realize is how trivial most apps are and how fast fully functional back ends can be created with frameworks in those languages

A vast majority of web dev is incredibly simple. People tend to easily convince themselves that their app is very special and requires a lot of complex solutions to make it work. There’s a lot of incentives that lead people to that conclusion, but most of the time following the golden path laid out for you by an existing framework is going to produce a better product with much less effort.

jjeaff · on Dec 30, 2019

I run a saas platform and from the outset, I knew that we wouldn't need to do anything special or groundbreaking. We are just sort of line of business. But what I found is that the majority of the complexity started to show up when we are scaling and needed to be able to get results to the user very quickly. That's when you start to deal with queues and pubsub and architecting things to run in parallel. We have a process that only takes about 10 seconds to run completely. Which is acceptable lag time for us. But now, 20 people are simultaneously making that request and the person at the end of the line has to wait 200 seconds, which is not acceptable. This only happens occasionally, so adding servers would be a big waste. That's where the more complicated cloud setups start to help.

0x445442 · on Dec 30, 2019

> It’d be more interesting to talk about them, and how they came to make the wrong decision for what they thought were the right reason

My guess, essentially holding their finger to the wind. Most orgs can't afford to do real engineering.

Sevii · on Dec 30, 2019

Its more like 'Here at BigCo, we do x because it scales. We have learned it from various incidents that we won't tell you about, but trust us we have to do it this way."

buzzkillington · on Dec 30, 2019

I've worked at BigCo. It's resume padding with the fear of looking like an idiot for not knowing about the new tech.

We were going to go all in on Snowflake with everyone on the team being for it. I sat down, read the original whitepaper, wrote a simulation of what the costs would look like with the current read/write statistics and tested a small batch of data on it to double check.

Turns out we would have paid between x500 to x10,000 what we were currently paying for using Postgres. I moved on three months ago, and last I heard they were trying to use Snowflake again.

pnako · on Dec 30, 2019

I'm starting to wonder if there is market to do "technology laundering", use things like PostgreSQL, SQLite, standard Unix tools, put it under some cloud marketing and charge a x10 premium.

Or perhaps not only there is market, but that's more or less what everyone is already doing.

IanCal · on Dec 30, 2019

If you can set it up so it solves my problems, yes.

I could manage my own server and fine tune everything, or I can throw it on snowflake. Snowflake means I've spent almost no time managing anything and it was costing less than an aws box running postgres but absolutely blew it out of the water performance wise. Depends on your workload but it's been perfect for one of my use cases. If they were just using postgres under the hood and I got the same en experience - fine.

Big things are time sharing resources and management/updates/etc. Can you charge me 10x the underlying cost but let me pay for two minutes on a wildly powerful machine? Great, that's a net win for me.

FpUser · on Dec 30, 2019

I know nothing about Snowflake so can not really comment on particular case. However the generic statement that I hear often goes something like this: it is so much trouble to manage your own infrastructure and on cloud everything is done for you. Well I saw with my own eyes that having ones's infrastructure deployed on Azure keeps them quite busy anyways.

0x445442 · on Dec 30, 2019

After a year of AWS I can tell you it's as much a PITA or more to manage all the components as it would have been to manage a few Spring Boot and MySQL Droplets.

vladf · on Dec 30, 2019

You use Snowflake for OLTP? Can you comment more on how/why?

IanCal · on Dec 30, 2019

Not really OLTP, analytical workloads but can't go into much detail I'm afraid. Infrequent, unpredictable and benefitting from rapid scaling (0 to lots of power & ram for short periods) is where cloud type things (can) really shine.

lonelappde · on Dec 30, 2019

This was part of the excitement/promise of Go lang, to eliminate the need for small-scale mapreduce jobs, with library code that launched a year or two before the cost paper was published.

8fingerlouie · on Dec 29, 2019

As someone who has implemented a complex system in C++ in this decade, I’d say he’s not wrong, but you need to carefully weight the pros and cons.

In our case latency and real time demands mattered a lot (NASDAQ feed parser), to the point of the (potential) slowdown of a garbage collector kicking in was enough to rule out Java and .NET. It runs entirely in memory and on 64+ cores.

We implemented our own reference counting system to keep some of our sanity, and to at least try to avoid the worst memory leaks.

This was an edge case, and for almost everything else you’re probably better off implementing it in something that handles memory for you. If performance is an issue, least try it in Go or Rust with a quick PoC before jumping the C++ wagon.

davidcbc · on Dec 30, 2019

Memory management in modern C++ is considerably easier than it used to be. Bespoke memory managers aren't really needed, you can do almost anything you need to without ever using new or delete.

stevenwoo · on Dec 30, 2019

Out of curiosity, what is the field for which you find this to be true?

wffurr · on Dec 30, 2019

If you are writing C++, any field! Just use std::shared_ptr and std::unique_ptr from the standard library, along with std::make_shared and std::make_unique.

kccqzy · on Dec 30, 2019

C++ shared_ptr unfortunately is artificially slowed down by multithreaded synchronizations. The count in the control block is always updated atomically. But I frequently need to use them not because my data is shared by multiple threads, but because the ownership situation isn't static. Consider for example implementing a single-threaded persistent tree.

This means people frequently need to reinvent their own reference counting mechanisms.

Rust does this right. Rc and Arc.

rumanator · on Dec 30, 2019

To be fair, std::shared_ptr is just a high-level component that's a part of the STL. Although it's defined in the C++ standard, it's just a generic shared pointer implementation designed to be as robust and bullet-proof as possible.

As with everything in the C++ STL, if you care about bleeding edge performance then you should be prepared to use more perform any (and less generic) components, whether it's data structure implementations or shared pointers.

sillysaurusx · on Dec 30, 2019

This is the quickest way to kill your performance in C++. I guarantee it wasn’t what Carmack was talking about.

I used sharedptr extensively in a game engine. Whoops: suddenly 20% of the frame time was gone, never to be recovered. Once that performance is gone it’s almost impossible to get it back, short of rewriting every system.

wffurr · on Dec 30, 2019

Only use shared_ptr when you actually want shared ownership. If you stick to unique_ptr for owning references and then "borrow" that raw pointer in functions, then you get speed without sacrificing too much safety and still no new/delete.

slavik81 · on Dec 30, 2019

Additionally, shared ownership should be rare. Most objects should have a single owner that is responsible for them. This is not only better for performance, but also makes the system easier to understand.

paulddraper · on Dec 30, 2019

Yes shared_ptr is relatively expensive; avoid it in your inner loops, but know that it's there for the other 80% of your code.

naniwaduni · on Dec 30, 2019

On the bright side, you didn't lose 200% of the frame time!

mianos · on Dec 30, 2019

And, use unique pointers where possible. I see a lot of code using shared ptr where unique would suffice, even in my own code.

berkut · on Dec 30, 2019

std::shared_ptr uses CAS atomics, which in heavy multithreaded code (multiple threads operating on the same pointers), can have surprising overhead in some situations.

usrnm · on Dec 30, 2019

On the other hand, not using atomics in heavy multithreaded code is just a recipe for disaster. The problem with shared_ptr is the fact that it uses atomics even in single-threaded code, which is obviously an overkill.

jcelerier · on Dec 30, 2019

> The problem with shared_ptr is the fact that it uses atomics even in single-threaded code, which is obviously an overkill.

On the other hand imagine the security issues if shared_ptr was not thread-safe - you could just not reliably destroy a shared_ptr in any thread.

If you really know that you're going to get a graph of shared_ptr that do not move of a single thread, you can use boost::local_shared_ptr explicitely.

masklinn · on Dec 30, 2019

One of the advantages of rust is its ability to segregate safe-to-share (between concurrent contexts) and not so, at compile time.

While it can’t (yet?) be generic over them, it lets you use the non-atomic Rc in thread-bound structures and know this is never going to be shared between threads, whereas the more expensive Arc can have handles be moved from one thread to an other.

safercplusplus · on Dec 30, 2019

Counterparts for Rc [1] and Arc [2] (and compiler enforcement of their safe use [3]) are available in C++, though not standard.

[1] https://github.com/duneroadrunner/SaferCPlusPlus#reference-c...

[2] https://github.com/duneroadrunner/SaferCPlusPlus#tasyncshare...

[3] https://github.com/duneroadrunner/scpptool

easytiger · on Dec 30, 2019

> Memory management in modern C++ is considerably easier than it used to be

It was never actually an issue

einpoklum · on Dec 30, 2019

... until you got the segmentation fault, that is.

CoolGuySteve · on Dec 29, 2019

If you're parsing market data then you shouldn't really be allocating at all once the initial setup is complete. So there shouldn't be a need for ref counts.

This is because the allocator itself can have an unbounded runtime that takes milliseconds, causing you to drop.

In the past I've replaced malloc with an implementation that asserts if called on certain threads after init time.

tom_ · on Dec 29, 2019

Milliseconds??! What causes that to occur? Though I suppose if they can take that long, avoiding them is an even better idea than it is normally...

For constant time allocations, try TLSF: http://www.gii.upv.es/tlsf/

ycombobreaker · on Dec 30, 2019

Most likely it is simply invoking a syscall (sbrk or mmap), which invokes the scheduler and may yield the time slice. Feed parsing code may not issue any other syscalls (entirely userspace), turning the allocation into a bit of russian roulette.

cogman10 · on Dec 30, 2019

I'm not the op, but I'm guessing large amounts of fragmented memory is what causes that.

It's one benefit of a gced language with compaction, allocations are typically bounded (except when they trigger a gc).

CoolGuySteve · on Dec 30, 2019

Except compaction can also take many milliseconds and come from different threads.

Writing a trading system in Java is harder than C++ imo because where before you had an allocation problem, now you have a multithreaded randomly stalling allocation problem.

Virtu did it but everything I’ve heard about it nullifies the benefits of using java in the first place.

cogman10 · on Dec 30, 2019

Newer Java GCs are very low latency (microsecond). You trade performance and memory for that low latency though. AFAIK, they are still compacting.

Still though, probably makes sense to do it in a lower level language. It's just far easier in C++ to decide that "Hey, you know what, I just want a big memory block that I control".

I've even heard of game devs doing things like having per frame allocators. They get super fast allocation because they just pointer bump and at the end they simply reset the pointer back to location 0. I'm sure trading systems could do something similar.

nwallin · on Dec 31, 2019

The point is that a one millisecond pause is unacceptable. Low latency Java GCs have average latencies of one millisecond, 99th percentile latencies of 10 milliseconds, and 99.9th percentile latencies are neither measured nor optimized for.

I don't consider it realistic to think that garbage collected languages might ever be usable in the context of game engines or HFT.

cogman10 · on Dec 31, 2019

> game engines

Game engines are already written in GCed languages. Java in particular.

You may be right that a 10ms pause is unacceptable for HFT. However, for a FPS, 10ms is more than acceptable. It translates to 1 or 2 missed frames in the worst case.

A bigger issue with using Java in particular for games is it's lack of value types. Writing high performance code for java is just that much harder because the language gets in the way.

CoolGuySteve · on Jan 3, 2020

The only serious engine I know of that's written in Java is Minecraft and its performance problems are notorious.

Microsoft XNA and Unity's .net support were also pretty popular but the popular games written in those languages (like Bastion) didn't have many heavyweight assets or allocator pressure.

pnako · on Dec 30, 2019

I've seen state-of-the-art low-latency trading software written in C# or Java.

You should not be doing any memory allocation in the critical path anyway, so garbage collection is hardly a relevant factor.

pkolaczk · on Dec 30, 2019

Having written lot of Java, Scala and C++ (and recently some Rust), I must say it is much easier to avoid heap allocations in C++ and Rust than in GCed languages, thanks to explicit allocation on the stack and pass by value + move semantics.

chusk3 · on Dec 30, 2019

A big push in .net core 2.x and 3.x was what we call the Span-ification of the base class library and the runtime. This means there are many new APIs for dealing with slices of memory in a non-allocating manner, and this combined with memory pooling has contributed greatly to an overall performance boost to the runtime by reducing copying and GC time. These same APIs are available to the developer so I'd imagine that it would be simple to build a non allocating network buffer reader. I have built a non-allocating video renderer before using the new APIs, for example.

pkolaczk · on Dec 31, 2019

There exist libraries for native memory pooling in Java as well, and we're using them. I'm not saying low allocation code can't be done in C# or Java. But these languages don't give some nice tools that are present in C++ and Rust - in particular RAII and automatic reference counting.

masklinn · on Dec 31, 2019

> I'm not saying low allocation code can't be done in C# or Java.

With modern tools and APIs it should even be possible to write alloc-free work loops in C# (though you're probably writing pretty alien C# at this point). AFAIK it'll remain impossible in Java until the "value types" effort bears fruits.

A few months back a study of sort made the rounds implementing a network driver in multiple languages (Ixy), C# did extremely well in it (better throughput than Go and near competitive with C and Rust at higher batch size though it was way behind on latency) while Java was pretty much in the dumps.

In one of the reaction threads (don't remember if it was on HN or Reddit) one of the people involved explained the discrepancy between Java and C# by not being able to go under ~20 bytes of allocation per packet forwarded in Java.

There were other odd / interesting results from the effort e.g. Rust was slightly slower than C, in investigating that they found out Rust executed way more instructions (especially significantly more stores) but had significantly higher IPC and much higher cache hit rates.

ixy also tried just converting the C code to Rust (using C2Rust) then compiling it and that turned out to be very slightly faster than the C code, which was funny: https://github.com/ixy-languages/ixy-languages/blob/master/R...

ncmncm · on Dec 30, 2019

A trading system in C# or Java is not, practically by definition, state-of-the-art. To suggest otherwise just demonstrates not knowing the State of the Art.

"Capable" is plausible.

Sorry, I doesn't make the rules.

choeger · on Dec 29, 2019

But aren't the edge cases the things that satisfy your pay? The standard cases of today are the introductory examples of tomorrow and will be automated or abstracted away by the end of next week. In JavaScript.

bsenftner · on Dec 29, 2019

I'm 15 years in writing high performance Internet servers in C++, and I can confirm higher level languages provide an illusion of capability, but once you're talking high performance with high compute requirements and scaling your service, the cost efficiency of C++ is exponential better than any other language. The higher level language ecosystems are bloated beyond repair. I was able to use one 32-core physical server running a C++ http server I wrote providing a rich media web service and replace an AWS server stack that cost my client 120K per month. The client purchased one $8K 32-core server and co-located it behind a firewall at a cost of $125 per month. And the C++ server ran at 30% utilization, plenty of room for user growth. Their AWS stack of a dozen C#, Python, PHP and Node apps were peaking their capacity too. If course, my solution caused existential questioning by the non-geek CEO and the CTO, but they were in crisis and needed to radically revise how they provided their service or close.

cthalupa · on Dec 30, 2019

The thing that worries me about stories like this is that there is frequently (as is the case here) no mention any sort of HA or backups. No details on what disaster recovery looks like. Those are business critical considerations that cost money, and just disappear from the discussion when people say "hey i saved all this money dropping everything down to a single server!"

bsenftner · on Dec 30, 2019

Well, in my case described above, single server solutions include an automated backup sub-system, and my servers expect multiple instances of itself to running on the client network, and these multiple instances synchronize with one another via additional endpoints specific for the purpose. The whole issue of HA and backups is critical and one of the areas my approach shines.

cthalupa · on Dec 31, 2019

You're still not actually answering the question of how you are HA and backing things up with only a single physical server and nothing else.

If you're backing things up to the same server, that's not enough. If the HA instances are running in the same server, that's not enough.

If there are other things besides that one physical server and it's power/network, you didn't include them in the cost, so the comparison is disingenuous there.

bsenftner · on Jan 2, 2020

I do say the deployed system ends up being multiple instances of my single server, which synchronize with one another. Those are separate physical devices each running my one server. Additionally, when a backup runs the data is stored locally as well as on a physical storage device separate from the hardware it is running. Typically clients already have a firewall/router which is used to distribute requests to the various instances. My deployed systems are not one server, they become a server mesh.

cthalupa · on Jan 6, 2020

>. Those are separate physical devices each running my one server. Additionally, when a backup runs the data is stored locally as well as on a physical storage device separate from the hardware it is running. Typically clients already have a firewall/router which is used to distribute requests to the various instances.

Awesome! Really glad to hear this is the case. But those are all added costs beyond the single physical server you gave the price of.

Server cost * number of physical servers you have deployed Cost of your off-server storage Cost of your network appliance doing load balancing

It's still probably way less than the AWS bill, but it's not really fair to compare the total price of infrastructure in one environment vs. just a portion of the other.

FpUser · on Dec 30, 2019

I am in the same boat (writing native servers). I will also "disappear" if you start asking me about HA/backups/etc. A particular solutions are very much case specific and can depend on business conducting rules just as much as on pure tech factors. Properly answering your question requires way too much writing and hardly a subject of a single post. I have HA solutions for the products I built but this post is the extent I am willing to talk about the subject ;)

cthalupa · on Dec 31, 2019

I'm not asking about the specifics, and I don't really care about them.

But the fact of the matter is quite simply is that any single physical server solution will never be satisfactory for backups or HA purposes.

You can't store your backups in the same place as your data and call it good - what do you do when you have multiple disks fail and your RAID can't be rebuilt? This happens. What do you do when operator error accidentally destroys the array? This happens. What do you do when there's a datacenter fire and the server burns up? This happens. None of these things should be a business ending event, but if you only have a single server handling quintuple duty, that's what it has a real chance of being.

If you need HA, a single server isn't good enough even if you've got multiple VMs running the service. What do you do when the utility power is out and the generator fails to kick on properly or they run out of fuel? Both of those happen pretty frequently. What do you do when there's a network outage at the DC? This happens. When someone fucks up BGP somewhere and now the prefix you used is being routed to god knows where? This happens. When you have any sort of physical server failure that brings your single box down? This happens. Any of those situations will take you offline and render your HA meaningless.

I'm not saying that any of these things are impossible to do when colocating your hardware - but they're not free, and they're not mentioned even at a high level in this story. And since we're not talking about running benchmarks on price/performance and instead talking about a service that a business needs to keep available to their customers to make money, these are important aspects to talk about. Everywhere I've worked, keeping our services available and being able to recover from hardware failure are far more important priorities than being able to optimize performance. And doing those things properly takes more than one physical server.

bsenftner · on Jan 2, 2020

I can't speak for FpUser, but you misunderstand the idea of creating a single server: one does not just run one of them, they know about and expect multiple copies of themselves to be running at different IP addresses, and they synchronize with one another, as well as maintain individual backups that additional background processes validate between different instances.

Each and every one of your disaster scenarios is handled by the architecture. Each an every one of your disaster scenarios has happened and we've lived through them, as well as after the fact reviewed and optimized how we handled the events. As you are, we're professionals.

crazypython · on Jan 8, 2020

Making my Node.JS app run on cheap providers raised performance and reduced cost several times over, such as OVH.

willhslade · on Dec 30, 2019

This sounds like an amazing war story that I just want to hear more of. Is there any more? What's your c++ stack like?

bsenftner · on Dec 30, 2019

Currently using Restbed (https://github.com/Corvusoft/restbed) as the server core, wxWidgets as a server side gui, with Boost, Curl, SQLite and Standard Lib. It's not that complex, beyond using lambdas in a few places. It has extremely high performance, and can run on an Intel Compute Stick, but I tend to use an Intel Nuc at minimum, with clients typically using whatever they have, gaining over redundancy and an ability to pair down. The memory management in a C++ application is just another resource one manages with whatever level of algorithm support you feel comfortable. There are ref-counting systems and complete garbage collectors available one can integrate into their business logic, unlike in a high level language that "transparently" manages memory outside application control. Have you ever thought about how much processing a typical 3D video game performs every frame? What if that caliber of optimized algorithm logic were handling a rich media, non-3D game server hosted business application? It would have pretty amazing performance and scale very economically. That's what I do. Before doing this, I wrote 3D video games and their production environments.

tjchear · on Dec 30, 2019

I love hearing optimization stories. can you tell us more about algorithm optimizations you've done in both business apps and video games?

crazypython · on Jan 8, 2020

AWS is expensive. It costs 20 times more for CPU.

lmm · on Dec 30, 2019

Horizontal scalability carries a lot of overhead. Probably a factor of 10, easily. But the clue is in the name: eventually you'll get to a point where you have to scale.

Back in 2010 I worked for a company whose system, in Java, ran on a single web server (with one identical machine for failover). We laughed at our nearest rivals, who were using Ruby, and apparently needed 60(!) machines to run their system, which had about 5x the average request latency of ours.

Then traffic doubled, and suddenly we were having to buy six-figure servers and carefully performance-optimize our code, and our rivals with the 120 Ruby servers didn't look so funny any more. And then traffic doubled again.

Silhouette · on Dec 30, 2019

But the clue is in the name: eventually you'll get to a point where you have to scale.

Isn't that the conceit underlying this whole argument, though? Many systems won't ever get to the point where you have to scale in that way, if you build them efficiently in the first place.

Perhaps more importantly, for many applications, you'll be able to see the limit coming some way ahead, and if you've reached a size where you do need a fundamental restructuring in order to start scaling horizontally, that's going to be a nice problem to have and you'll also have the resources to do it.

I know several online systems that are handling significant traffic volumes perfectly well on a simple, single-server basis. They don't get bogged down in infrastructure and tooling issues, ever. They don't get confused by complicated cloud hosting issues, ever. They are free to spend almost their entire development budget on actually developing useful functionality, which is like a breath of fresh air in today's dev culture.

Obviously in the more serious cases they probably also have some redundancy for backup/failover purposes, but even that is simple and if necessary can probably be handled manually when you only have a handful of servers to manage. Here I do slightly disagree with one of Carmack's later tweets, in that I would argue 100 servers instead of 1000 is just accounting, but 100 down to 10 or fewer is also more of a qualitative change (albeit not exactly the same qualitative change as 10 down to 1).

paulddraper · on Dec 30, 2019

Web servers are usually trivially horizontally scalable.

You must have had significant in-memory shared state to encounter that problem. Right?

Had you adopted a less stateful model, you'd have looked rather pretty with two Java servers.

lmm · on Dec 30, 2019

> You must have had significant in-memory shared state to encounter that problem. Right?

"Significant" is in the eye of the beholder. The core of the system was easy to make shardable. But you'd be surprised how many implicit assumptions creep in, how easy it is for ancillary parts to end up sharing state when it's easy. Also note that just because your state's in a database doesn't mean having two instances of the thing that accesses it will work, you can easily end up with an access pattern that assumes only one reader (for example) even though on paper there's no in memory state.

> Had you adopted a less stateful model, you'd have looked rather pretty with two Java servers.

Maybe. At the point where we're running 4 or 8 servers we'd have been facing much the same ops problems that they were. Java bought us an extra year or two of not having to deal with that, but also a significant amount of migration work when it just became impossible to stick to the single process model. At the end of the day we still kept the 5x latency advantage, which is definitely not nothing. But there were also definitely features that they brought to market quicker, and I'm pretty sure Ruby played a part in that.

Tradeoffs, tradeoffs everywhere. I left before the final outcome of that fight (for all I know it's still ongoing), but I don't think either company was being dumb.

paulddraper · on Dec 30, 2019

> At the point where we're running 4 or 8 servers we'd have been facing much the same ops problems that they were.

Yes and no.

You'd need some ops work, but you'd need to worry a lot less about managing your infrastructure provisioning to keep costs low, e.g. reserved instances, dynamic scaling, etc. and putting out fires when you inevitable exceed your tight perf margins.

You could overprovision 24/7 by 50% and write it off. Your competitors couldn't.

gerbilly · on Dec 30, 2019

> Web servers are usually trivially horizontally scalable.

Maybe he really meant to say 'application server'. I.e that there was server side code with non trivial compute/memory requirements for running business logic.

paulddraper · on Dec 30, 2019

By "web server", I meant (Java) "application server".

steev · on Dec 30, 2019

This sounds like an architecture problem, not a language problem. Can you elaborate?

lmm · on Dec 30, 2019

Of course it's possible to write a horizontally scaled application in Java or C++. But once you have to deal with horizontal scaling anyway, language performance is much less of an advantage: as Carmack says, the difference between 100 servers and 10 is just accounting.

betaby · on Dec 30, 2019

And the difference between 1000 servers and 100 is not just accounting. In essence, 1 full rack is easy to reason about while N>1 is not, for myriads of non accounting reasons.

adrianN · on Dec 30, 2019

The Internet uses on the order of 10% of the world's electricity. A 10x difference is huge.

jsty · on Dec 30, 2019

Out of curiosity - what's the source on that? Most figures I've seen put datacenter usage an order of magnitude lower:

e.g. https://www.iea.org/reports/tracking-buildings/data-centres-...

"Global data centre electricity demand in 2018 was an estimated 198 TWh, or almost 1% of global final demand for electricity (Masanet et al., 2018)."

adrianN · on Dec 30, 2019

This random site from I found by googling: https://www.insidescandinavianbusiness.com/article.php?id=35... I was surprised by that number myself, because I also thought that it was in the single digits.

icebraining · on Dec 30, 2019

It should be noted that report is talking about the client devices too - desktop, laptops, and even TVs. The datacenters part only accounted for 15% of that.

kcb · on Dec 30, 2019

Sometimes in a specialized team, the difference between 100 servers and 10 is profitability. Don't want to get laid off because that cloud bill is $100k a month.

RhodesianHunter · on Dec 30, 2019

Spoken like an individual contributor far enough down the ladder to not have to worry about the bill.

lmm · on Dec 30, 2019

In my experience it's the individual contributors who are overly obsessed with being elegant and efficient in their use of machine time. Those who are conscious of the bigger picture tend to have a more accurate sense of the relative costs of machine time versus engineering effort.

otabdeveloper4 · on Dec 30, 2019

Multiplying your datacenter bill by 10 because you couldn't be arsed to spend two days thinking about system architecture isn't an "accurate sense of the costs" in any universe, except the one where you're spending venture capital bucks and waiting to get quckly bought by a Google with more free money than they can count.

winrid · on Dec 30, 2019

Sounds like they had their datastore as part of the application. So only way to scale it would be to write a distributed data store... Or rewrite the app :p

bcrosby95 · on Dec 30, 2019

Horizontally scaling a monolith web app has basically zero overhead. Both in app design and production complexity. But it can go decently far.

lmm · on Dec 30, 2019

Only if you're assuming that the web app is entirely stateless. Which won't be the case unless it was deliberately designed that way.

ummonk · on Dec 29, 2019

Yes, I’m always shocked by just how much performance overhead most languages have compared to C and similar lower level languages. It is a price worth paying for better language ergonomics, but I do wonder whether Rust might be able to give us the best of both worlds here.

twic · on Dec 29, 2019

I semi-seriously think the entire modern shape of the cloud is a result of Ruby being really slow.

Back when people were writing their backend business apps in C++, COBOL, Java, etc, if there was ever a performance problem, you could usually just get a slightly bigger machine and grow your thread pools a bit. But once the web took off and Ruby exploded onto it, you couldn't do that, because it's an order of magnitude slower, and doesn't really do multithreading. But, as long as you follow twelve-factor discipline, it scales horizontally like a champ. So, we took to horizontal scaling over multiple VMs (and caching things in Redis instead of local memory or Hazelcast or whatever), and that's been the unquestioned way to do scaling ever since.

ww520 · on Dec 30, 2019

The push for the need of scaling out started with Ruby and Python's lack of performance. The reason being pushed at the time was, "developer time was more expensive than hardware." Well, that didn't count the amortization of developer time over the lifetime of the product once the product was developed.

shrimpx · on Dec 30, 2019

It's mostly a fallacy that a demanded product becomes "developed". Maybe a game that gains cult status and therefore a long tail end of life. But popular web services are in constant churn and in that space it's valid to trade hardware for programmer productivity.