It's interesting that they've found denormalizing their log data so useful. I'm ...

shawnz · on July 30, 2019

Denormalization typically improves performance. Normalization isn't done for performance reasons but for consistency reasons, i.e. so that data isn't duplicated and there is only one source of truth

brandur · on July 30, 2019

Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.

And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:

- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.

- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.

- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)

ryanworl · on July 30, 2019

How many columns does the average canonical log entry at Stripe have? What's the mix of low/high cardinality string fields look like vs number of metric/counter fields?

brandur · on July 31, 2019

On the order of many dozens of fields and it's a pretty good mix of all of those.

Lots of low cardinality fields, lots of counters and numbers (e.g. request duration), and quite a few high cardinality fields too. e.g. IDs, IPs.

manigandham · on July 31, 2019

Logs can be treated as database rows regardless of source format (plaintext, csv, JSON, etc). The modern approach for dealing with large scale tables is column-oriented storage and databases which can easily handle billions of log lines without indexes by using ordering, partition maps, compression, etc.

rco8786 · on July 30, 2019

As others mentioned, normalization is generally about saving space not increasing performance.

That said I’m 100% positive that all those key value pairs are indexed for searching and querying purposes.

RhodesianHunter · on July 31, 2019

Not so much indexed, just columnar.

rodgerd · on July 30, 2019

It's also about DRY and accurately modelling the data for general operations. Imagine the difference between a data environment where a GDPR deletion request comes in, and everything has a relation back to customer identity, and one where the customer identity is denormalised out to many places or only implicitly present.