Today's uninteresting log noise is tomorrow's critical data.
I've been loving Kibana for filtering and reporting on log data in flexible and insightful ways, including automatically generated charts for certain data sources.
Yes. Using loglevels/priorities, facilities and identities in a sane way, your logs would already be classified.
Let's say there's a service failure and I want to know what the service has done prior to the failure. I wouldn't want a classifier to filter the logs in that case, so that use case is out of the picture. What other use cases than filtering are there for this? Maybe as a way to provide feedback to developers to fix the log messages, as in "this thing that we log all the time can be determined to never affect the process of trouble-shooting our services, and the classifier thinks it's noise, so we'll remove it".
It would be neat if MachineBox could sense whether log noise would be useful in other contexts--e.g., as a metric that can be graphed. Or whether your logging is lacking something that might be useful, or just lacking signal at all (hey, user, your logs are just noise!).
This is a fairly useful way of removing relatively useless information such as timestamps and line numbers when you're looking for rare or unique events. The alternative, I think, is to do a bunch of awk or sed magic, which isn't really fun for anybody. It's especially useful in a time crunch when there's an ongoing outage.
Is it possible to make a ML algorithm which has only "noise" data for training and then identifies abnormalities? It seems like that's people do that easily and it would be ideal for an application like this where you might not have much training data on all the "not noise" types of examples.
Another application would be a security camera that detects unusual events without having to train it on actual burglars.
Maybe an easier way to go is to record it structured up front (it’s already structured in the original application source anyway). This makes it much easier to record efficiently (so you can record more data) and also much easier to query efficiently, where eg you might invest time in machine learning on logical data instead of having to mess around with text.
That’s what we do here anyway, it’s worked well for us:
I have limited experience, but I think that usually you would take this into account when building your loss function and heavily penalize false negatives during training.
I've been loving Kibana for filtering and reporting on log data in flexible and insightful ways, including automatically generated charts for certain data sources.