Hacker Newsnew | past | comments | ask | show | jobs | submit | mlacitation's commentslogin

The design document is a good read and includes high-level details of how they're grabbing packets (AF_PACKET), the packet index format (leveldb), and defensive action they took (fuzz testing via AFL, setcap, setcomp):

https://github.com/google/stenographer/blob/master/DESIGN.md


Hey, thanks! If you have any additional questions about the design process, internals, etc, feel free to ask. I'm the primary author of the project, and I'll be refreshing the HN post for the next hour or so trying to answer questions as they come up, and/or updating the docs.


What kind of performance do you see when searching over, say, 10TB/day or two of captured data? It seems like the query would have to open a file for every minute? Have you considered a higher level index, to tell which minute files are worth inspecting? (I realize this only helps when searching for more unique characteristics.)

Is LevelDB the best choice out there for write once KV pairs? For, say, IP address indexing, what's the final bits/packet overhead of indexing?

I didn't see any compression for the packet data. Did you consider high perf compression like LZ4?

Is AF_PACKET better than PF_RING+DNA? It's been a while since I looked but with hardware accel they boasted massive perf advantages.

Excellent design docs and cool work!


Hey, great questions!

Query Performance: Right now, we've got test machines deployed with 8 500GB disks for packets + 1 indexing disk (all 15KRPM spinning disks). They keep at 90% full, or roughly 460GB/disk, about 1K files/disk. Querying over the entire corpus (~4TB of packets) for something innocuous like 'port 65432' takes 25 seconds to return ~50K packets (that's after dropping all disk caches). The same query run again takes 1.5 sec, with disk caches in place. Of course, the number of packets returned is a huge factor in this... each packet requires a seek in the packets file. Searching for something that doesn't exist (host 0.0.0.1) takes roughly 5 seconds. Note that time-based queries, like "port 4444 and after 3h ago and before 1h ago" do choose to only query certain files, taking advantage of the fact that we name files by microsecond timestamp and we flush files every minute.

A big part of query performance is actually over-provisioning disks. We see disk throughput of roughly 160-180MB/s. If we write 160MB/s, our read throughput is awful. If we write 100MB/s, it's pretty good. Who would have thought: disks have limited bandwidth, and it's shared between reads and writes. :)

We actually don't use LevelDB... we use the SSTables that underly LevelDB. Since we know we're write-once, we use https://github.com/google/leveldb/blob/master/include/leveld... directly for writes (and its Go equivalent for reads). I'm familiar with the file format (they're used extensively inside Google), so it was a simple solution. That said, it's been very successful... we tend to have indexes in the 10s of MBs for 2-4GB files. Of course, index size/compressibility is directly correlated with network traffic: more varied IPs/ports would be harder to compress. The built-in compression of LevelDB tables is also a boon here... we get prefix compression on keys, plus snappy compression on packet seek locations, for free.

We currently do no compression of packets. Doing so would definitely increase our CPU usage per packet, and I'm really scared of what it would do to reads. Consider that reading packets in compressed storage would require decompressing each block a packet is in. On the other hand, if someone wanted to store packets REALLY long term, they could easily compress the entire blockfile+index before uploading to more permanent storage. I expect this would be better than having to do it inline. Even if we did build it in, we'd probably do it tiered (initial write uncompressed, then compress later on as possible).

AF_PACKET is no better than PF_RING+DNA, but I also don't think it's any worse. They both have very specific trade-offs. The big draw for me for AF_PACKET is that it's already there... any stock Linux machine will already have it built in and working. Thus steno should "just work", while a PF_RING solution has a slightly higher barrier to entry. I think PF_RING+DNA should give similar performance to steno... but libzero currently probably gives better performance because packets can be shared across processes. This is a really interesting problem that I'm wondering if we could also solve with AF_PACKET... but that's a story for another day. Short story: I wanted this to work on stock linux as much as possible.


Thanks for the detailed reply! But I'm really curious: That performance is a bit beyond the spec'd max spec for such HDDs (3.0ms seek + 2ms latency, so 50K random IO should need around 31 seconds with 8 disks. I'm guessing a bit of clustering in the packet distribution improves seek time so a sector/page contains multiple hits?

I'm interested because I wrote an app-specific indexer, but with requiring "interactive" query response times over a couple TB, for multiple users. But that was years ago, before LevelDB and Snappy, and Kyoto Cabinet had far too much overhead per kv), and on small CPUs and a single 7200rpm disk. I got compressions rates of 5 to 6 using QuickLZ; a non-trivial gain.

I was looking at this problem space again and considering a delta+int compression approach to offsets, given they're just incremental. (And there are cool SIMD algorithms for 'em.) But it sounds like SSTable + fscache is fast enough, wow, that's pretty cool!

The decompression of blocks in some apps doesn't have to be much of a penalty if there's a reasonable amount of clustering going on in the sample set. What I did was instead of just splitting blocks on time, I segmented them based on flow and time. I did L7 inspection, and an old quad-core Core2 could handle 1Gbps, so 10Gbps is probably achievable nowadays, certainly for L4 flows. That way there's great locality for most queries.

Further, the real cost is the seek, and transferring a few more sectors won't cost as much. If you're using mmap'd IO for reading, you might be able to compress pages and not pay any IO penalty, right? And in fact, it might even reduce the number of seeks, due to increasing clustering of packets onto the same page. And I think some of the fastest compression algorithms only look back a very small amount, like 16K or 64K anyways? Although, this is probably easier done just by using a compressed filesystem cause the cache management code is probably nontrivial.


I think the reason we're getting faster performance is that we tend to have packets clustered on disk, as you've surmised. Since packets with particular ports/IPs/etc tend to cluster in time, there's a good chance that at least a few will take advantage of disk caches. Even if we clear the disk cache before the query, the first packet read can cache some read-ahead, and a subsequent packet read may hit that cache entry without requiring an additional seek/read.

As far as compressing offsets, I haven't done any specific measurements but my intuition is that snappy (really any compression algorithm) gives us a huge benefit, since all offsets are stored in-order: they tend to have at least 2 prefix bytes in common, so it's highly compressible.

I experimented with mmap'ing all files in stenographer when it sees them, and it turned out to have negligible performance benefits... I think because the kernel already does disk caching in the background.

I think compression is something we'll defer until we have an explicit need. It sounds super useful, but we tend not to really care about data after a pretty short time anyway... we try to extract "interesting" pcaps from steno pretty quickly (based on alerts, etc). It's a great idea, though, and I'm happy to accept pull requests ;)

Overall, I've been really pleased with how doing the simplest thing actually gives us good performance while maintaining understand-ability. The kernel disk caching means we don't need any in-process caching. The simplest offset encoding + built-in compression gives great compression and speed. O_DIRECT gives really good disk throughput by offloading all write decisions to the kernel. More often than not, more clever code gave little or even negative performance gains.


Yeah it's very impressive how fast general systems have become, eliminating a lot of the need for clever hacks.

I wonder how much would change if you were to use a remote store for recording packets, like S3 or other blob storage. In such cases the transfer time overhead _might_ make the compression tradeoff different. And the whole seek-to-offset might need a chunking system anyways (although I guess you can just Range when requesting a blob, but the overhead is much larger than a disk seek).


> The design document is a good read ...

and large chunks of code is in go :), with only performance related stuff (read packet-capture) being done in c++, pretty cool.


I founded 4chan eleven and a half years ago at the age of 15, and after more than a decade of service, I've decided it's time for me to move on.

4chan has faced numerous challenges over the years, including how to continuously satisfy a community of millions, and ensure the site has the human, technical, and financial resources to continue operating. But the biggest hurdle it's had to overcome is myself. As 4chan's sole administrator, decision maker, and keeper of most of its institutional knowledge, I've come to represent an uncomfortably large single point of failure.

I've spent the past two years working behind the scenes to address these challenges, and to provide 4chan with the foundation it needs to survive me by bolstering its finances, strengthening its infrastructure, and expanding and empowering its team of volunteers. And for the most part, I've succeeded. The site isn't in danger of going under financially any time soon, and it's as fast and stable as ever thanks to continued development and recent server upgrades. Team 4chan is also at its largest, and while I've still been calling the shots, I've delegated many of my responsibilities to a handful of trusted volunteers, most of whom have served the site for years.

That foundation will now be put to the ultimate test, as today I'm retiring as 4chan's administrator. From a user's perspective, nothing should change. A few senior volunteers—including 4chan's lead developer, managing moderator, and server administrator—have stepped up to ensure a smooth transition over the coming weeks.

I'll need time away to decompress and reflect, but I look forward to one day returning to 4chan as its Admin Emeritus or just another Anonymous, and also writing more about my experience running 4chan on my personal blog. The journey has been marked by highs and lows, surprises and disappointments, but ultimately immense satisfaction. I'm humbled to have had the privilege of both founding and presiding over what is easily one of the greatest communities to ever grace the Web. It was truly an honor to serve as 4chan's founding administrator, and I look forward to seeing what the next decade holds for the site.

On to the next chapter,

–moot

mootnote: I plan to dedicate this Friday afternoon (ET) to host a livestreamed Q&A session, where I'll hold court with the community one last time. Be sure to subscribe to our YouTube page, and keep an eye out for a global message the day of. As always, I welcome and read all of my e-mails, and you may contact me personally at moot@4chan.org.

Eleven and a half years, in numbers:

42,176,061,890 Total pageviews 1,771,091,423 Total posts 1,071,189,182 Total visitors 620,125,147 Monthly pageviews 21,128,887 Daily pageviews 20,360,487 Monthly visitors 1,223,807 Daily visitors 2,838 Terabytes per month 105 Volunteers 63 Boards 1 Administrator


See also (just don't do any banking/disable javascript):

bestproxyunblock.com proxify.com unblockingproxy.net freeopenproxy.com freeproxywebsite.net proxyserver.com proxysite.com newipnow.com browserproxy.net proxfree.com


Is there an online archive that I can view before signing up?


I just released this, so unfortunately I don't have an archive (yet).

However, you can view a fully filled example here: http://websecweekly.org/static/newsletters/example.html

Thanks!


Great ressource, many thanks for releasing this. May I suggest showing this example directly on the HP?


Great idea! Done! :)


This looks great, thanks :)


Thanks for the shoutout! Twitpic-Backup is pretty old but will definitely get your images. Props to pcgMongo for today's PR:

https://github.com/Stantheman/Twitpic-Backup/pull/2


I have a mirror of this site. wooledge.org runs off of greycat's (#bash on freenode) home DSL connection:

http://bash.cumulonim.biz/BashPitfalls.html


Seriously? They can't find a better place to host a site like this here in 2013?


If you already have a machine serving stuff at home, adding a remote server is only time, money, and effort spent. Is it worth it? Not all sites are high in volume, and even well prepared sites on decent lines can go down if it hits the HN front page.

People do occasionally bitch and moan about slow downloads from my home host, but nobody ever offered money for a pain-free alternative.


Google doc, gist, tumblr, any free blog engine.


None of which are pain-free to me.


Dropbox? Unless you're pushing gigabytes per day it should make a perfect static host with urls no more ugly than previously mentioned services.


Thanks. I don't understand why people submit links with content that can't be accessed by more than a few users simultaneously.


How would one know this in advance?


How would one know in advance that submitting to HN would result in traffic consisting of more than a handful of users?


How would one know in advance whether a site can handle high traffic?

And did I really have to spell this out?


Well it is the main reference but it won't sustain the "HN effect", thanks for the mirror link. I typically use http://wiki.bash-hackers.org/doku.php which also provides a mirror and some extras.


Have you tried a recent version? This commit should have made things much better:

https://github.com/linode/longview/commit/dc48b6ddce04dc7155...


Do you remember any? I'd love to hear some!


I do! Although not totally sure if Professor Kernighan intended for them to leave the classroom. Occasionally he'll throw an email from a former associate into the lecture slides (e.g., James Gosling), but I've noticed that he's removed them from the public-facing slides on the site, which makes me think I should be somewhat cautious. Maybe eventually, with his permission, I'll write some up and post them on HN.


With regards to the real-time analysis, you might find perl's Regexp::Debugger fun to use:

http://search.cpan.org/~dconway/Regexp-Debugger-0.001011/lib...

That's the documentation for using the module in your code. If you're interested in the standalone tool, you'll want rxrx:

http://search.cpan.org/~dconway/Regexp-Debugger-0.001011/bin...


Off-topic, but you searched AltaVista? Did you choose to use it specifically or is it your daily driver?


AltaVista is just a re-skinned version of Yahoo Search (which itself is driven by MSFT, but has some modifications vs Bing). So this isn't as odd a choice as it might first seem.


Rob Pike actually wrote something similar in his short, self-proclaimed polemic, "Systems Software Research is Irrelevant":

http://herpolhode.com/rob/utah2000.pdf

"Only one GUI has ever been seriously tried, and its best ideas date from the 1970s. Surely there are other possibilities. (Linux’s interface isn’t even as good as Windows!) There has been much talk about component architectures but only one true success: Unix pipes. It should be possible to build interactive and distributed applications from piece parts. The future is distributed computation, but the language community has done very little to address that possibility."

"The world has decided how it wants computers to be. The systems software research community influenced that decision somewhat, but very little, and now it is shut out of the discussion..."

And particularly relevant to the HN community at large: "Be courageous. Try different things; experiment. Try to give a cool demo."


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: