Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.
And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:
- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.
- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.
- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)
How many columns does the average canonical log entry at Stripe have? What's the mix of low/high cardinality string fields look like vs number of metric/counter fields?
And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:
- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.
- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.
- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)