It's interesting that they've found denormalizing their log data so useful. I'm suprised to hear that that performs better for practical queries than a database with appropriate indexes, and that they've been able to build more ergonomic interfaces to query that than the standard relational approach a lot of people already have experience with. But I don't know much about log management at scale, so I'm only mildly surprised.
Denormalization typically improves performance. Normalization isn't done for performance reasons but for consistency reasons, i.e. so that data isn't duplicated and there is only one source of truth
Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.
And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:
- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.
- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.
- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)
How many columns does the average canonical log entry at Stripe have? What's the mix of low/high cardinality string fields look like vs number of metric/counter fields?
Logs can be treated as database rows regardless of source format (plaintext, csv, JSON, etc). The modern approach for dealing with large scale tables is column-oriented storage and databases which can easily handle billions of log lines without indexes by using ordering, partition maps, compression, etc.
It's also about DRY and accurately modelling the data for general operations. Imagine the difference between a data environment where a GDPR deletion request comes in, and everything has a relation back to customer identity, and one where the customer identity is denormalised out to many places or only implicitly present.