The Full Stack, Part I

nerfhammer · on Dec 2, 2010

> Well, if the table is in InnoDB format it will result in one disk-seek, because the data is stored next to the primary key and can be deleted in one operation. If the table is MyISAM it will result in at least two seeks, because indexes and data are stored in different files. A hard drive can only do one seek at a time, so this detail can make the difference between 1X or 2X transactions per second.

Nonsense.

Innodb has clustered primary keys, which means that the row data is attached to the leaf nodes of the primary key index as the author correctly states. However, the leaf nodes and the non-leaf index nodes are actually stored in different segments of the table space! While in the same (giant) file, it is unlikely that they would ever be in contiguous space on disk enough to be read in a single random IO operation.

But it's more complicated than that: if any of the index pages or data pages have been read recently they will probably still be in the buffer pool, which means that they will require no disk operations.

But that's just the seek operation to find the row. The write operation is a different story yet.

What innodb will do is modify the row by marking it with the transaction id in which it was deleted. It will keep the row in place so readers with an older transaction id will still see it until all those transactions are complete. The change in the row and the row page will be written to the copies of the affected pages in memory only. Eventually the data pages and any affected index pages will be flushed to disk, potentially grouped with other changes to the same pages. IO operations occur on the level of reading and overwriting whole pages only, if not more.

Concurrently it will record in the log buffer every change it makes to the pages in memory. This won't get written to disk right away either, it will flush the log buffer to disk once per second in the default configuration.

So there are many more potential disk operations required of innodb than myisam. Generally innodb is preferably because it is vastly more reliable, and because it can handle concurrent read/writes to the same data -- MyISAM basically can't. MyISAM will in fact generally be FASTER for any single operation than InnoDB, because it simply does less.

aristus · on Dec 3, 2010

I think you're right. I'll try to come up with a simpler example and update the piece. Thanks!

aristus · on Dec 12, 2010

Corrected, though I'm not completely happy with it:

http://carlos.bueno.org/2010/11/full-stack.html

tom_b · on Dec 2, 2010

I started as a performance software person, from memory and cache conscious algorithm design in grad school, to network stack testing, to web server/db server/client-side performance testing and optimization. It is an excellent way to develop that feel for the big picture of how things are working. In enterprise software, there used to be a lot of low hanging fruit and it was fun to get "heroic" results with simple SQL tuning.

The flipside of that experience and mindset is that now, as I try to shift my way towards more functional and declarative programming styles is that I sometimes get sucked into the premature optimization black hole - I overthink rather than just doing some simple exploratory programming.

I keep telling myself I'm going to implement some fairly large project using nothing but lists and a Lisp to break it down.

ohyes · on Dec 2, 2010

I get around this by implementing the abstract version of my code first; I then implement the abstractions that I used in the abstract version of my code. So I'm not even worried about running code until I have a general idea of what I want it to be doing.

I obviously then rework stuff as I learn that my assumptions were totally wrong, but this way I've started from an incorrect conceptual model; rather than an incorrect full implementation. I think this makes it easier to manipulate.

edit: And I guess the point of that is that according to XKCD, I start at sociology, and work my way down the stack until I get to whatever language/libraries (hopefully not Math) I'm actually programming in. Like some sort of crazy fractal.

chipsy · on Dec 2, 2010

It's definitely a fine line between taking a "bottom-up performance" mindset and a "premature optimization" one. Way too often, I'll start approaching something with a fast-complex solution, spend some hours on it, run into major bugs, and then quickly turn around, do the simple thing instead and usually get it done in under half an hour.

I have the feeling that I can only really overcome this by knowing the domain well enough that I don't feel the urge to take on such big problems anymore. It seems like there's a vast middle ground of knowledge where you know enough to not be naive, but you don't know enough to understand what you're getting into with the harder solution.

tedunangst · on Dec 2, 2010

I don't think I agree with the conclusion that performance will "depend almost completely on the random seek time." You can store the entire 10TB library on 5 2TB spinning drives. 5 drives can easily serve up 500mbps (that's only 60MB/s, one drive territory). So, on to seeking.

2000 streams, 5 drives, that's 400 per disk. Let's say we have the world's worst disks, that can only do 10 seeks per second. 400 / 10 means we have to buffer 40 seconds of data per stream (per seek) and we have to read it in 0.1 seconds before moving to the next stream. 300kbps * 40s / 8 = 1500K of data. 1.5M / 60MB/s disk transfer takes 0.025 seconds, well under 0.1.

I guess that's alluded to by "non standard prefetching", but I don't think it's that advanced. Especially since in a streaming video application, the client software is already going to be doing buffering for you. The bottleneck is bandwidth.

Check my math please? :)

aristus · on Dec 2, 2010

This sounds reasonable. AFAIK the default readahead on Linux is 128KB, so you'll probably need to test and tune quite a bit to get it right. I had a paragraph roughly along those lines in an early draft, but decided it was too much detail. "Concurrent reads" is probably more accurate than plain "seek time".

tedunangst · on Dec 2, 2010

The kernel readahead is actually irrelevant. Your app issues 1MB reads one at a time and the kernel will fetch that all at once.

ajays · on Dec 3, 2010

That's all fine and dandy till the users decide to pause, seek to a different spot in the stream or switch streams altogether rapidly.

When a user hits "next" to go to the next movie, it should start playing in less than a second or two. When a user seeks to a new chapter in the movie, the same thing.

Your math works if the users passively sit and consume a stream without much interaction.

wrs · on Dec 2, 2010

That's the "put all 10TB on each streaming server" solution that he immediately rejected because of expense. You can see that the more streaming servers you connect to the storage system, the more important seek time becomes. He did then say "whatever architecture", which is not quite right since the seek bottleneck may not appear until the streaming:storage ratio gets higher than 1.

For a realistic exercise, throw a cost factor into your equation and optimize for cost and latency (aka user experience).

tedunangst · on Dec 2, 2010

In the big picture, are the drives that expensive? Your monthly bandwidth bill is going to come close.

Measuring latency (vs seek time) is definitely a better metric. But is a 5 second (the horror!) delay to watch a 40 minute video that bad for user experience?

A better example may have been streaming music, where we already know last.fm got a big win from using SSDs.

aristus · on Dec 3, 2010

The drive themselves might be an extra 10 or 20% of the cost of the machine. They come with other factors as well. 5x the disks means more power draw and cooling. It probably also means you need a 2U chassis instead of pizza-box 1U, which means your datacenter rent doubles. (It may be possible to fit 10TB into a 1U. I've never tried it, but I suspect you'll have space / power / heat problems.)

brown9-2 · on Dec 2, 2010

Reading this makes me really curious how Netflix's Instant Watching service is architected. Anyone have any details?

contol-m · on Dec 3, 2010

As far as I know, Netflix does not handle the actual streaming of bits - it is done by CDNs like Akamai or Limelight.

iamelgringo · on Dec 3, 2010

They have been moving a lot of their infrastructure over to EC2: http://www.zdnet.com/blog/btl/netflix-migrating-more-infrast...

siong1987 · on Dec 4, 2010

Netflix is using Level3 as their CDN to stream movie - http://www.thestreet.com/story/10916915/level-3-surges-on-ne...

aw3c2 · on Dec 2, 2010

This is very interesting and well written so someone "dumb" like me can learn a lot about planning and optimisation from this. Riht in the third paragraph I just realised why some SQL query I wrote is so slow, embarassing but true, heh!

lsb · on Dec 3, 2010

Are you just inexperienced? Don't prematurely sell yourself short. I'd read that a Rails hosting company found that 70% of their customers hadn't properly indexed their tables, and knowing's half the battle.

ajays · on Dec 3, 2010

A much more readable, FB-free version of this is at his blog, http://carlos.bueno.org/

dsmithn · on Dec 3, 2010

Thanks, facebook.com is blocked at work.

efsavage · on Dec 3, 2010

A company well-stocked with full-stack engineers who can communicate makes the execution of good ideas so frictionless and the death of bad ones so quick and painless that it cannot help but to succeed.

If this is the caliber of engineering talent that Facebook values, there should be no surprise they are taking over the world.

gcb · on Dec 2, 2010

From the ex-yahoo who got 'famous' by using a microwave near his wifi laptop to simulate dropped packages.