Saving $30k a month by improving garbage collection

gopalv · on Aug 11, 2021

> We wrote a one-off tool to go through our GCS buckets and delete all the files that didn’t have an entry in the manifest.

...

> Specifically, we had to ensure two things — every file removed from the manifest has an entry in BadgerDB, and no file locator that resides in BadgerDB is present in the manifest.

This feels a little weird to read, because I've been working with things which are moving in the exact opposite direction right now.

So we had Hadoop S3Guard[1] which kept track of deleted files in a bucket so that we would detect a delete + read race before S3 had strong consistency. And that's stored in Dynamodb, which is very much a KV store and it was somewhat of a nightmare to keep track of these things.

Moving onto Apache Iceberg which has roughly the same design discussed here (file based manifests + orphaned files for failed commits) and we're going there because storing data in a standalone metadata service is getting a bit old-tech (the Hive ACIDv2 keeps this info as number sequences, which is very postgres-like, but needs an FS listing to start off).

So bit by bit, we're moving from a KVStore to file-manifests to make systems more scalable. In that context, a problem like this exists very clearly in my future and I wonder if there's a better way to prevent it than going back to a kv-store model again (particularly when the manifests can fork into a tree thanks to data-sharing with snapshots).

[1] - https://www.slideshare.net/hortonworks/s3guard-whats-in-your...

ninja420 · on Aug 12, 2021

interesting. > And that's stored in Dynamodb, which is very much a KV store and it was somewhat of a nightmare to keep track of these things. what problems did you face to keep track of these things in a KV store? good thing about badgerdb in our case is badgerDB data files live on the same filesystem as our file based manifests and so in essence its just a fancy file based manifest that supports efficient features like append only operations, partial reads, crash recovery etc. In theory we can implement all such features on our existing manifest files. With badgerDB we just get them out of the box and snapshots still work as expected. So we are trying to move all the information in manifest files to separate badgerDB instances for each shard. But till we have source of truth in manifest files as well as badgerDB (while we are in transition), we have to do the dance of synchronizing data between the two.

gopalv · on Aug 12, 2021

> good thing about badgerdb in our case is badgerDB data files live on the same filesystem as our file based manifests

I think that's pretty much the difference - once you start storing this in a distributed system (like dynamodb), you start getting a bit weird timing issues on how to serialize operations which could potentially step on each other. Distributed systems of consistency are usually a headache when you let users pick what key they want to create (basically anyone could create a path named "/tmp/foo.txt" at the same time from anywhere in the platform).

From my point of view, you have built a good WAL implementation using BadgerDB instead of doing the thankless & distracting work of separating fsync from fdatasync.

> So we are trying to move all the information in manifest files to separate badgerDB instances for each shard

And as you mentioned, because you don't want to put a WAL in a generic mutable store (instead of being append-only logs), that makes sense. Instead of switching to an append-only store & switch to it, it makes sense to move the whole thing in.

I haven't touched this area in a decade, but when I had to work with zBase at Zynga (my org built it, but I paid attention to the serialization + compression, not the storage persistence), it had its metadata store as sqlite files (the "file manifests" would be sqlite files), which was a ridiculously simple thing because the writers were single machine + single mutex (pinned to a core to reduce jitter) to record metadata. This was also built in the era of spinning rust and specifically for EBS (so there was a lot of group-commit coalescing of requests, which naturally flowed into BEGIN DEFERRED).

Was a joy to debug in a lot situations where I was happy there wasn't some custom protobuf parser needed to join + merge + update the data on disk while it was turned off.

ckdarby · on Aug 12, 2021

Midway through I realized they're reinventing solved problems already.

Mixpanel should take a look at Apache Iceberg for your writing and Apache Pulsar to keep your cost lower by not needing to keep 7 day retention in your pipeline once the messages are ack'ed by all consumers.

For replaying you can use Trino to just read from your Iceberg and insert back into your stream.

i0exception · on Aug 12, 2021

It's a lot harder to make such drastic changes to something that's already serving production traffic at scale. It's okay to reinvent solved problems sometimes because the alternative is be a multi-year project with unclear returns.

Also, the 7 day retention is a feature - it helps us recover data quickly in case of bugs in our storage code.

ckdarby · on Aug 20, 2021

I had written a novel of counter arguments and realized anyone who tries to tell someone that their bug is a feature isn't going to see the but for what it is.

Best of luck, and God help the poor souls who have to maintain this when the original authors have left.