What's wrong with 2006 programming? (2010)

antirez · on Dec 21, 2016

Hello, a few things that changed in the latest 6-7 years:

1. The Redis project abandoned attempts to have a mixed memory-disk approach, at least for the near future. I want to focus on trying to do at least one thing well and it is already hard ;-) You know, the no-need-to-konquer-the-world approach. Otherwise the project per se is interesting. Redis Labs has a commercial fork that works that way for instance (which I believe was initially based on the Redis "diskstore" branch I was working on in order to replace the former "virtual memory" Redis feature), but not the OSS side. Maybe I'll change my mind in the future but so far I can't see signs of my mindchange ;-)

2. About threads, we are now a bit more threaded: Redis 4.0 is able to perform deletion of keys in background, Redis Modules have explicit support for blocking operations that use threads, and so forth. However my goal in the next 1/2 years is to finally have threading in the I/O, in order to scale syscals, protocol parsing, to multiple threads but not data access. So regarding the 2006 programming, things will be the same.

Basically I still believe that to do application-side paging now that disks are also faster (ratio compared to RAM) is an interesting approach. I still think that using the kernel VM to do so is a bad idea in general, but could work for certain apps.

eternalban · on Dec 21, 2016

> Basically I still believe that to do application-side paging now that disks are also faster (ratio compared to RAM) is an interesting approach. I still think that using the kernel VM to do so is a bad idea in general, but could work for certain apps.

Please elaborate. If disk/block-device performance is improving, wouldn't the VM benefit as well?

Also the last sentence seems to make more sense the other way around: VM in the general case, user-land memory management for "certain apps".

antirez · on Dec 21, 2016

The OS VM would benefit, the problem is with using the OS VM in order to implement paging in certain applications like Redis. Does not work well because there is a tension between flexibility of in-memory representation and data locality, and OS VM needs data locality because it has no info about content and requires logically grouped data to be in near pages.

About VM in the general case: yes if for general case you mean, a random process is running and is out of memory. If we are talking about in-memory systems wanting to off-load data to disk IMHO the default is that VM does not work well.

camtarn · on Dec 21, 2016

Since the article doesn't mention it til quite far in: Redis is apparently single threaded, which is why the blocking nature of OS page swapping is so disastrous. Presumably for a more traditional server with lots of worker threads this would be less true.

tyingq · on Dec 21, 2016

It does make the conversation interesting, as the "varnish guy" sort of subtly suggests that single threaded is subpar. Which seems odd, given that nginx is single threaded, and in a somewhat similar space to varnish...and seems to enjoy a good reputation for performance.

mdasen · on Dec 21, 2016

http://www.cs.princeton.edu/~vivek/pubs/pai_flash_99.pdf

That's a reasonably good paper on the trade-offs between event-driven, multi-threaded, and hybrid approaches to file serving.

I don't know that much about nginx in particular, but it seems like they've implemented thread pools for blocking operations: https://www.nginx.com/blog/thread-pools-boost-performance-9x.... "Hard drives are slow (especially the spinning ones), and while the other requests waiting in the queue might not need access to the drive, they are forced to wait anyway." So, if you're blocking reading a file from the hard drive, all the other requests are queued up behind it.

The thread-pool approach noted in the nginx blog sounds pretty much the same as the approach in the linked paper.

nginx does have a good reputation for performance, but I think a lot of that reputation comes as a front-end for web applications rather than serving lots of hard-to-cache files.

Anyway, the nginx blog article as well as the academic paper note that single-threaded event-driven has drawbacks around file io and using a worker pool of threads or processes to offload blocking operations onto can help mitigate that.

tyingq · on Dec 21, 2016

The thread pools are optional, and reading the link you posted, not recommended unless specific conditions exist. They use streaming media as a good use case for the thread pools.

Nginx is commonly used as a caching proxy, and called out as being high performance in those cases. I can't speak as to whether what's being cached is "hard-to-cache" files.

citrin_ru · on Dec 21, 2016

> nginx does have a good reputation for performance, but I think a lot of that reputation comes as a front-end for web applications rather than serving lots of hard-to-cache files.

Netflix uses nginx to serve hard-to-cache files (using aio sendfile under FreeBSD).

kelnos · on Dec 21, 2016

nginx is a forking server, though, so individual workers being blocked wouldn't affect others, and the application as a whole can use all available CPU cores.

tyingq · on Dec 21, 2016

That's roughly the reply from antirez regarding redis being single threaded, just run more processes.

Let's start with Redis being single threaded, the path we are taking is to build "Redis Cluster"...This means that Redis will run 48 instances in a 48 core CPU...

kelnos · on Dec 28, 2016

Right, but running redis in that manner is far more operationally-intensive than running several nginx worker threads.

nginx runs workers by default, which (I believe) can be tuned by a couple config options.

To run multiple redis instances as a part of the same cluster, you need a way to shard your data (which you have to reason about client-side), you need separate config files, data directories, etc. for each instance. It's a huge pain.

smegel · on Dec 21, 2016

It is subpar on Linux. On an engineered OS like FreeBSD it is amazing. There is a reason Netflix/Cloudflare/more run all their nginxs on BSD.

https://www.nginx.com/blog/thread-pools-boost-performance-9x...

baotiao · on Dec 21, 2016

I think the different between redis and webserver like nginx is that all the operations in redis is almost the same, it is about less than 1ms. However the request to nginx fall in a widely range, some request need 10ms, while some request need 10s. Since nginx need do some file operations.

So the single model work well for redis, but it doesn't work well for nginx, since if there is a request in nginx that is blocking for about 10s, people can't tolerate this situation.

dvirsky · on Dec 21, 2016

It is important to note that in the many years since this post, while redis has remained single-threaded - it also removed the entire concept of VM, and now works only fully in memory.

baotiao · on Dec 21, 2016

However, redis transfer these works to jemalloc. Now jemalloc control the entire VM

dvirsky · on Dec 21, 2016

In the past you could tune redis to hold a dataset larger than the memory you had, and it would swap pages on its own. About a year after this 2010 post, antirez decided to remove this completely (in redis 2.6 or 2.8, I don't remember) and focus entirely on fully in-memory situations. VM in the redis sense used to be redis itself swapping stuff to disk with multiple threads.

Here are the redis configuration notes on VM from redis 2.2:

# Virtual Memory allows Redis to work with datasets bigger than the actual

# amount of RAM needed to hold the whole dataset in memory. # In order to do so very used keys are taken in memory while the other keys

# are swapped into a swap file, similarly to what operating systems do

# with memory pages.

....

# vm-max-memory configures the VM to use at max the specified amount of

# RAM. Everything that deos not fit will be swapped on disk if possible, that

# is, if there is still enough contiguous space in the swap file.

...

# Redis swap files is split into pages. An object can be saved using multiple

# contiguous pages, but pages can't be shared between different objects.

# So if your page is too big, small objects swapped out on disk will waste

# a lot of space. If you page is too small, there is less space in the swap

# file (assuming you configured the same number of total swap file pages).

# If you use a lot of small objects, use a page size of 64 or 32 bytes.

....

# Max number of VM I/O threads running at the same time.

# This threads are used to read/write data from/to swap file, since they

# also encode and decode objects from disk to memory or the reverse, a bigger

# number of threads can help with big objects even if they can't help with

# I/O itself as the physical device may not be able to couple with many reads/writes operations at the same time.

# The special value of 0 turn off threaded I/O and enables the blocking Virtual Memory implementation.

vm-max-threads 4

baotiao · on Dec 21, 2016

get it, thank you

geerlingguy · on Dec 21, 2016

The comments on this post are enlightening. I use both Varnish and Redis, and the architecture discussion is great!

koverstreet · on Dec 21, 2016

One thing that would really help is if we had buffered asynchronous IO.

trevyn · on Dec 21, 2016

(2010)

ploxiln · on Dec 21, 2016

PHK's post, which inspired this, assumes that the process is swapping. It describes writing an page to disk to free up that page, then reading in the anonymous page of data that needs to be used for the write() system call the process uses to manually cache the data to disk. For the stuff that I use and work on, if the system is swapping anonymous pages, the situation is dire and it's time to kill (processes).

Let me back up and try to explain a bit:

While OS kernel developers have put a huge amount of effort into virtual memory management and paging, which was and is a good and necessary thing, the definition of "interactive" and "low latency" has changed. Long ago, half-second latency at a virtual terminal connected to a mainframe with hundreds or thousands of users was fantastic, compared with dropping off your stack of punch-cards and coming back 12 hours later.

For most of the software I use and work on today, I want low sub-second latency. It's often only achievable with reasonable direct control of what is in memory and what is on disk. If I click a menu in a GUI program that I haven't clicked in weeks, I don't want to wait half a second for a few scattered pages to be paged in/out of swap. Same goes for requests to web or api servers - I don't want less-common requests to take a half second longer than the typical 50ms or so. For desktop environments, GUIs, databases, caches, services: no swap.

Certainly, data, multimedia files, dictionaries, etc will need to be read from disk. The processes can arrange for separate threads to do that. We can have responsive progress bars, cancel buttons, priorities, timeouts before hitting an alternative data source - but only if the process itself is in RAM, not in swap.

Now that desktop and server systems measure DRAM in 10s of gigabytes, this really should not be hard to achieve!

I've struggled with swap and out-of-memory situations on Linux many times. The linux kernel never seems to OOM-kill processes fast enough for me. If I have no swap, then if memory pressure sets in, the kernel struggles to shrink buffers, practically freezing most processes, for a few minutes before finally killing the obvious culprit. (I've also tried memory-limiting containers, and they suffer the same problem - freeze up for a few minutes instead of immediately killing when OOM.) I used to enable plenty of swap, more than RAM, because that was the common wisdom, but it causes the same problem when the system comes under memory pressure, everything freezes for a few minutes. But it also has the additional problem that despite setting swappiness to 1 or 0, some strange services/applications will cause the kernel to put some anonymous pages in swap, even when there's plenty of free physical memory. I never want that! I need to periodically swapoff and swapon to correct it.

So, at each company I work for, I end up writing a bash script, run by cron each minute, which checks for low system memory, looks among the application services for an obvious culprit, and sends it SIGTERM. In practice, this solves the problem pretty much every time, in the most graceful way. It's extremely rare that a critical system process is the problem or looks like the problem. (Except dockerd a couple times ;)

(This is not to bash Linux in particular, Windows and MacOS use way more RAM and swap in general. I've heard the BSDs have been good at particular things at particular times, but driver support has always been more of a struggle. Besides the swap / OOM behavior, I'm pretty happy with Linux.)

Letting the OS manage disk and RAM makes perfect sense for bulk data processing - hadoop, spark, or other map-reduce or stream-processing where a few seconds pause here and there is no problem if throughput is maximized. But I personally don't work much on those things - and I'm not a rare case.

smegel · on Dec 21, 2016

> OS paging is blocking as hell

No, Linux is rubbish. Seriously. FreeBSD does this properly.

Edit: FreeBSD, Windows, OSX, Solaris, AIX, HP-UX(?)...

doublerebel · on Dec 21, 2016

The way SmartOS performs under memory pressure is completely different from Linux, the OS is still usable where Linux would be completely frozen. Admittedly I don't know the underlying implementation behind this feature.

trungaczne · on Dec 21, 2016

Do you have any articles that talk about how FreeBSD does memory management differently?

smegel · on Dec 21, 2016

https://people.freebsd.org/~jlemon/papers/kqueue.pdf

wmf · on Dec 21, 2016

Admittedly I didn't read it in detail, but I don't see anything about page-ins there. Can you explain the connection between kqueue and paging?

tedunangst · on Dec 21, 2016

There is no connection.

FreeBSD does page faults just like everybody else. Your process blocks until the memory is read in.

smegel · on Dec 21, 2016

Sure it would be my pleasure. Kqueue allows a read request to be scheduled that is non-blocking on a page fault. Linux always blocks the thread executing read() on a page fault. This is still true using aio_read(), as all that does is run another thread to call read() which blocks. Which is great for small numbers of read requests but scales poorly.

And the bit from the paper that is relevant:

> A non kqueue-aware application using the asynchronous I/O (aio) facility starts an I/O request by issuing aio read() or aio write() The request then proceeds independently of the application, which must call aio error() repeatedly to check whether the request has completed, and then eventually call aio return() to collect the completion status of the request. The AIO filter replaces this polling model by allowing the user to register the aio request with a specified kqueue at the time the I/O request is issued, and an event is returned under the same conditions when aio error() would successfully return. This allows the application to issue an aio read() call, proceed with the main event loop, and then call aio return() when the kevent corresponding to the aio is returned from the kqueue, saving several system calls in the process.

SamReidHughes · on Dec 21, 2016

Linux's family of async I/O functions is io_submit and friends, not aio_read. That paper is too old to be relevant.

Also, using a worker pool for I/O scales just fine and has no real disadvantages when talking to a disk or SSD.

wmf · on Dec 21, 2016

OK, so you're talking about AIO and other people here are talking about mmap. If you have working AIO then you can indeed write a fully async server at the cost of extra memory copies.

smegel · on Dec 21, 2016

Sadly mmap is also blocking on page faults on Linux :(

mfukar · on Dec 21, 2016

I would be very interested in how you envision a system that does NOT block on page faults.

lgunsch · on Dec 21, 2016

Read and write system calls have nothing to do with jump instructions causing a page fault by traversing data structures. That is what Redis is all about - in memory data structures available to clients.

smarnach · on Dec 21, 2016

What else can you do other than blocking until the page has been loaded? How would it be possible to resume a single-threaded process while the memory it's trying to access is not available?

smegel · on Dec 21, 2016

You say "hey process, this data is not available yet, but if you listen on this event port I will sent you a notification when it is ready, then you can do whatever you want with it. In the meantime, continue to serve up data that is in memory".

Pretty simple stuff really.

smarnach · on Dec 21, 2016

You might be able to do that in response to a system call asking whether a memory page is available, but not in response to a page fault. A page fault occurs if the process tries to access a virtual memory page that's not currently mapped to physical memory. Unless I'm missing something, the only reasonable way of resuming the process at the machine code instruction that caused the page fault is to first make sure the memory is available. Otherwise you will simply get another page fault.

dllthomas · on Dec 21, 2016

You don't necessarily need to resume the process at the machine code instruction that caused the page fault.

smarnach · on Dec 21, 2016

Right, you also have the option of not resuming it at all.

dllthomas · on Dec 21, 2016

I mean in principle, you could resume in another place. That's what's being done if you register a handler for SIGSEGV, for instance. Not that there's much you can do there, with existing programming models.

smegel · on Dec 21, 2016

1. thread: hey kernel, I would like to read this page of memory.

2. kernel: hey thread, that page is still on SLOOOOW spinning disk, why don't you go off and do something else while I get it for you. I'll let you know with an event notification, so be sure to check in with kqueue regularly.

3. thread: OK then, i'll go off and serve these other people while you do that for me. kthanx.

4. kernel: hey DMA controller, I'd like you to get sectors 4, 5 and 6 from platter 3 on HDD 2 and load them into memory address 0x4fe6bb. Please send me an interrupt when done.

5. DMA controller: OK servo, please adjust read head to this offset. Read head, read me those bytes. Memory controller, please store bytes at address 0x4fe6bb. Hey CPU, here's an interrupt to store in your interrupt table, please wake up the kernel guy and let him know.

6. kernel: wow I just got interrupted. The interrupt seems to map to a request for data from this particular thread. Better let him know. (sends event up thread's kqueue).

7. thread: hey, I just got a kqueue notification that the file is now ready to be read. That means it must be in memory...cool!

Make sense now?

smarnach · on Dec 21, 2016

I get that you can do this in response to a system call (the one issued in step 1). You could avoid blocking in Linux as well using madvice() and mincore(), as pointed out in the comments to the article. However, once you get an actual page fault, the only option is to block the process.

smegel · on Dec 21, 2016

> I get that you can do this in response to a system call (the one issued in step 1). You could avoid blocking in Linux as well using madvice() and mincore()

Yes, but the one thing they don't give you is notification the data is in memory. Do you really want to spin a hot loop calling mincore?

> However, once you get an actual page fault, the only option is to block the process.

See steps 1-3 again.

lgunsch · on Dec 21, 2016

The only way to receive an event would be from a signal, like SIGSEGV. There is no way to hook a jump instruction up to kqueue. In any case, this would mean writing specific code to handle the case, implying it is not simply a difference of the FreeBSD kernel vs the Linux kernel.

twic · on Dec 21, 2016

This is roughly what 'scheduler activations' and 'kernel scheduled entities' did. Except rather than just being in response to page faults, it was any interaction with the kernel that would block.

Dylan16807 · on Dec 21, 2016

Please tell me how it's possible to do something like socket.write(data), when data has been swapped out, without blocking.

smegel · on Dec 21, 2016

1. make a non-blocking read on the data

2. while it is being paged into memory, <DO OTHER STUFF>

3. now your data is in memory and you can update it (writes are async by default even on Linux, as the write just goes into memory and will get synced out by the kernels page sync mechanism but you can override that by setting the O_SYNC flag).

Dylan16807 · on Dec 21, 2016

Triggering a non-blocking read and continuing execution requires either writing different code or an extremely elaborate page fault feature that can dynamically rewrite your execution flow behind your back.

Writing different code is not the question I asked. That's not an OS feature. Do the OSes you like have the latter ability, or are you blowing smoke?

to3m · on Dec 21, 2016

Windows can do this - its "non-blocking" read calls are better described as "asynchronous", and take ownership of the buffer until the job is cancelled or it runs to completion. So if the buffer is paged out, that's fine; the program can carry on and the buffer can get paged in when the system needs it (perhaps on an otherwise idle CPU - so it needn't necessarily take any time from the perspective of the calling thread).

The POSIX semantics on the other hand are simply that the read won't block due to lack of input, so it does as much as it can straight away and then returns. If there's data, but the buffer is paged out, the non-blocking read call has to take more time, because the problem is a paged-out buffer and not lack of input.

(The Windows equivalents of man page section 1 are so awful that most POSIX fans just run a mile and install cygwin. More fool them; once you get to man page section 2, it's a lot better. MAXIMUM_WAIT_OBJECTS is lame, though.)

smegel · on Dec 21, 2016

> Triggering a non-blocking read and continuing execution requires either writing different code

Well you are asking about doing something non-blocking. That means you must be in some kind of event-loop scenario, otherwise you would be happy with synchronous operations. And yes, you would need to add a read event on the file, and then trigger the socket.send() once the data is in memory.

I never said it was free. But at least it is viable.

Dylan16807 · on Dec 21, 2016

Your response to paging being blocking was to blame Linux. Every OS has paging be blocking. Linux has ways to do non-blocking IO. There is no reason to blame Linux.

lgunsch · on Dec 21, 2016

This assumes you are doing read and write system calls, not following pointers in memory.

In the Redis article, it is talking about in-memory data structures. The page fault happens from following a pointer to another location in memory, for example, a linked list, or a skip list. There is no read call to replace with an async read call. Your code evaluates the pointer to a page that has been swapped out, a page fault occurs, and the OS has to swap it back in for you to be able to read that memory.

You could potentially create a new signal, for page faults (maybe there is one I've never heard of), but that would still not let you continue executing from the previous location.

mfukar · on Dec 21, 2016

Your step 2 is magical, in the sense that OTHER STUFF wants THEIR data in memory as well.