Ask HN: How do you handle logging?

wenc · on Aug 28, 2019

1) Log to local disk (most people will tell you this is bad practice and that you should directly log to socket or whatever, but it's more likely for your network to be down than for your disk to fail).

In Python, use the RotatingFileHandler to avoid running out of space.

2) Incrementally forward your log files to a server using something like fluentd that can pre-aggregate and/or filter messages.

Big advantage of logging to disk: if logging server is unreachable, forwarder can resume once it's up again. If you log directly over network, if things fail the very log messages you need to troubleshoot the failure are potentially gone.

3) Visualize. Create alerts.

I've evaluated a bunch of logging solutions. Splunk is the best, and affordable at low data volumes (they have a pricing calculator, you can check for yourself). It's medium hard to setup.

Sumo Logic is the easiest to set up, and at low data volumes, prices are similar to Splunk. You can get something working within an hour or less.

ELK stack is free only in bits but not in engineering time.

I've not actually tried Sentry.io but I saw it at PyCon and it looks pretty impressive. If you only care about tracking errors/events and not about general-purpose logging functionality per se, I would take a serious look at it.

jasonrojas · on Aug 29, 2019

“ELK stack is free only in bits but not in engineering time.” — best thing I’ve read all week. Thank you.

ameyv · on Aug 29, 2019

This is so true!

envolt · on Aug 29, 2019

After 3-4 attempts of 7 days over the last 1 year. I finally had it up and running happliy.

AlchemistCamp · on Aug 29, 2019

How long would it take you to do it again on another project?

scruple · on Aug 29, 2019

Having walked this experience myself a few times. Borrowing from the GP: Repeats take an additional 3-4 attempts of 7 days over 1 year.

envolt · on Aug 30, 2019

That would still take 3 days.

viraptor · on Aug 28, 2019

Sentry is not for logging really. It's designed for errors/exceptions. I believe you should use rollbar/sentry/airbrake regardless of whether you use centralised logging. Or even before it.

penagwin · on Aug 29, 2019

+1 for sentry. You can actually self host it for free too!

LifeIsBio · on Aug 29, 2019

Another +1 for sentry. I used it (basically accidentally) for a year because it's part of the cookiecutter-django framework[1]. I don't think there was a single issue with it the whole time we used it.

[1] https://github.com/pydanny/cookiecutter-django

tuco86 · on Aug 29, 2019

+1. Sentry is amazing. We self host although they stopped shipping updates for over a year until recently. I'd pay for saas if it was up to me, tho.

codemac · on Aug 29, 2019

> it's more likely for your network to be down than for your disk to fail

For most people, the network being down means they can't reach the disk.

Buffering unsent logs via local disk or RAM is critical due to network flakiness for sure, but not logging over the network as well is a bad idea 100% of the time.

madhadron · on Aug 29, 2019

> For most people, the network being down means they can't reach the disk.

If you're talking about cloud solutions, that's what instance stores are for. Logging to local disk and then forwarding is still the best answer. And there's still a lot of world out there that's running on actual hardware.

scarface74 · on Aug 29, 2019

He specifically mentioned mobile apps. When developing mobile apps, you almost always operate in a “semi connected” state where ideally you can function without network access and rely on syncing.

aflag · on Aug 29, 2019

He's actually working in the backend, not in the mobile app itself.

linsomniac · on Aug 28, 2019

I set up Sentry a few years ago, based on a PyCon hallway-track talk I saw. We had a few false starts with it, but have integrated it with a couple of our newer platforms and have liked it.

It takes a kind of "ticket" approach to messages, it'll deduplicate and combine similar errors, and you go into a dashboard and see "We got ten thousand of this error, let me track it down, fix it, ack it and see if we keep getting it."

MuffinFlavored · on Aug 29, 2019

It blows my mind that there isn't an easily Googable example of how to pipe JSON logs on disk to logstash. It's 2019. The software should be able to come online, check what logs it has already indexed, and then `tail -f` files matching a pattern. I really thought this was a more common use case but it's obvious things went the "directly log to socket" route you mentioned.

stevekemp · on Aug 29, 2019

filebeat is one tool that does that, watching logs and piping to logstash.

spoonfoe · on Aug 30, 2019

I believe you can skip logstash now and have filebeat go directly to your elastic search indexer.

This simplified the ELK stack setup a whole bunch.

jetru · on Aug 29, 2019

We do a similar thing. Additionally, we use logrotate to move these files into S3, which later get ETLed into Parquet so we can scoop it for long term analytics.

caseyf7 · on Aug 28, 2019

One should keep in mind that ELK is quite expensive once you reach the point where you need to pay for it.

scaryclam · on Aug 29, 2019

How so? Honest question as I'm curious where the hidden costs might be.

est · on Aug 29, 2019

From my limited experience the ElasticSearch requires tons of excessive RAM, storage and CPU. The bloat rate is about 1.5X to 2X your data size.

devonkim · on Aug 29, 2019

With cloud options available for both Elasticsearch and Splunk these days difficulty of setup / ease of use may be better evaluated from the client perspective. I’ve setup both ES and Splunk in the past few years and it’s not terribly different at lower end scale (< 100 GB logs / mo). But currently Splunk is not as good as ELK for metrics and vice verse and with the recent SignalFX acquisition that may change in a couple years but it definitely isn’t now. Also, there’s tons of options for streaming logs to ES besides Logstash including Filebeat which is at least written in Go (Splunk’s forwarders are probably in C given I swear they’ve been mostly the same since the early 2000s).

hrktb · on Aug 29, 2019

Elastic cloud is not a great option for beginners, at least not at this point. For instance there is no purge of old indexes out of the box, a lot of configuration options are not available, performance issues are inscrutable.

I think it needs at least of few more months or years to be a no-brainer to choose.

TrickyRick · on Aug 29, 2019

https://sematext.com/cloud/ Sematext Cloud has been working well for me in a relatively small startup (Couple of thousands active users). Cheap solution and easy to get started.

mandeepj · on Aug 29, 2019

> network to be down than for your disk to fail

If network is down then how your users are going to reach your app?

GauntletWizard · on Aug 29, 2019

Netsplits happen. User actions can be queued on a server that has just lost connectivity. Your logging server can be down or broken itself. A file on a unix system isn't going anywhere fast.

enobrev · on Aug 28, 2019

Everything logs to syslog (I generally use rsyslog) in JSON format.

All syslog instances push to a central instance, also running rsyslog. This allows us to tail logs on each instance, as well as tail / grep system-wide on the central instance.

Central instance pushes everything directly into elasticsearch.

Using Kibana for searching and aggregating. Using simple scripts for generating alarms and reports.

Every day a snapshot of the previous day is uploaded to S3 and indexes from 14 days ago are removed. This allows us to easily restore historical data from the past, but also keeps our ES instance relatively thin for daily usage / tracking / debugging. It also makes it possible to replace our central log instance without losing too much.

All devs use some simple convention (ideally built into the logging libs) to make searching and tracing relatively easy. These include "request ids" for all logs pertaining to a single process of work, and "thread ids" for tracing multiple related "requests".

I documented how I have rsyslog and elasticsearch set up here: https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elas...

porker · on Aug 29, 2019

How do you change everything on a system to use JSON format? My syslog (Debian) is filled with text-line entries, and I've not seen a setting to change this.

enobrev · on Aug 29, 2019

By "Everything" in my post, I mean all of our own applications. Some services allow you to format logs to json like nginx using log_format[1]. For others, you may find app-specific configuration or plugins for log formatting or simply use plain grep / kibana text search.

I imagine in those cases something like logstash may help, but I don't really know as I tend to avoid logstash.

1: https://stackoverflow.com/a/42564710/14651

vinay_ys · on Aug 28, 2019

Since others have answered with specific tech stacks, I'll give a more generalized/abstracted answer. While getting started, here are a few high-level principles I found useful to adhere that will make your life easier later:

Think of a multi-stage pipeline for getting raw data from your transactional/interaction systems and extracting insights and intelligence out of them.

Stage-1: Ingestion – Keep this simple. Don't mess this up. Its a serious headache if you do.

1. Generate a request-id or message-id at the genesis of the request and propagate it throughout your call graph from client to servers (across any number of api call graph hops).

2. At each node in the call graph, emit whatever logs you want to emit, include this id.

3. Use whatever format you find natural and easy to use in each tech stack environment. Key is to make the logging instrumentation very natural and normal to each codebase such that the instrumentation does not get accidentally broken while adding new features.

4. Build a plumbing layer (agnostic of what is being logged) that can locally buffer these log messages, periodically compress and package them with added sequence and integrity verification mechanisms, and reliability transmit them to a central warehouse. Use this across all your server-side nodes. Build a similar one for each of your client side platforms.

5. At the central warehouse, immediately persist these log packages durably and then only respond to client indicating it is safe to purge those packages on their local nodes.

Stage-2: Use-case driven ETLs.

6. Come up with use-cases to consume this data. Define data tables (facts and dimensions) needed to support these consumption use case.

7. Build a high-performance stream processing system that can process the raw log packages for doing ETL (extract, transform and load) on the raw data in different formats to the defined consumable data tables.

Stage-3: Actual Use-case data applications.

Run your analytics and machine learning systems on top of these stable consumable data formats.

Keep the stages separate and decoupled in code and systems. Don't do end-to-end optimizations and break the boundaries. Recognize that the actors/stakeholders involved in each stage are different. The job of data team is to be the guardian of these stages and run the systems and org processes to support it.

meowface · on Aug 28, 2019

This is true if you plan to develop a SIEM or ELK completely from scratch. Interesting as general background info, but I can't see this information being practically useful to anyone who just wants to log stuff. It'd be like building a washing machine and drying machine from scratch because you want to wash your clothes.

You seem to be describing low level principles, not high level ones. A high level principle would be "forward your logs to a centralized logging service and let the logging library and the service do 100% of the work for you", which I think is what nearly everyone should do (and which most are already doing).

zawerf · on Aug 29, 2019

His first bullet point is probably the most practical thing I know about logging. Log lines are useless without context especially when the lines are interleaved. But this is easily fixed just by prefixing them with a context id that you filter on later. This requires no frameworks and even works with unstructured text logging.

Super simple, super useful, not everyone does it.

meowface · on Aug 29, 2019

True, but this is semi-automatically handled by any structured logging library, though. No need to reinvent the wheel or force yourself to remember to prefix or postfix every log message with one or more "%s"'s (or equivalent) and the ID(s) to interpolate. I think that's one of the main reasons to use a structured logging library in the first place, and maybe the main purpose of structured logging.

A simple example for a Python Flask app: http://www.structlog.org/en/stable/examples.html

    log = logger.new(request_id=str(uuid.uuid4()))
    log.info("user logged in", user="test-user")
    # gives you:
    # event='user logged in' request_id='ffcdc44f-b952-4b5f-95e6-0f1f3a9ee5fd' user='test-user'

Sure, if you absolutely must use unstructured logging, you need to remember to do the format string prefixing or postfixing for every single message. But why put yourself in that position if you don't need to? Other than maybe when maintaining large legacy apps that aren't worth the effort to add structured logging to.

SahAssar · on Aug 29, 2019

The point as I read it was to do this once, and only once per request. So if you have a few different microservices that call each other you generate the id once, either at the first service it hits or (preferably) at the load balancer, and then propagate it down to all other services. This is especially useful if you use queueing or methods of doing tasks not bound to the same process as what the request hits.

If I just copied your example I would probably have a few different ID:s for the same request in different parts of the application (unless it was a single-service app directly exposed to the internet).

meowface · on Aug 29, 2019

True, this is complicated by a microservice model. I haven't worked with microservices much, but I figure there must be some libraries and tooling out there that can make this simpler. From some quick Googling, it looks like this is a component of some microservice frameworks. But I understand that this is a case where you'd often have to implement this yourself into your architecture, like at a load balancer / reverse proxy. So point 1 is valid.

mustardo · on Aug 28, 2019

Such a HN answer,

The dudes...

>the backend developer at a mobile app startup,

How about syslog, ELK stack or something and focus on building the app

Some good points in there like correlation IDs etc all the same

stareatgoats · on Aug 28, 2019

Having battled the curse of spurious errors in both large and small systems without centralized logging I'd say looks like great tips to me. Even for smaller setups, who might need something like this more than anyone. (A simplified analytics setup in that case and the machine learning excluded I guess).

alex-bender · on Aug 28, 2019

> and propagate it throughout your call graph from

Have you tried something like opentracing.io ?

hooch · on Aug 29, 2019

Thanks. This provides a very useful architectural map to keep in mind whilst investigating the various implementations.

stickfigure · on Aug 28, 2019

I use the stackdriver logging in Google Cloud Platform.

My GAE apps and google services just log there automatically. My non-GCP services require a keyfile and couple lines of fairly trivial setup.

I have a single logging console across my entire system with nearly zero effort and expense. It works incredibly well. Doing this in-house is a waste of engineering resources.

jmb12686 · on Aug 29, 2019

Stackdriver is remarkably awesome for log aggregation, storage, and querying. Uptime checks to arbitrary HTTP endpoints are fantastic!

Not sure about other use cases such as visualization and triggering events. I assume they have an API or integrations for such things, just haven't needed it as of yet.

Their pricing changed recently, don't remember the details, but I do remember previously that non Google Cloud nodes did incur an additional cost. Free limits are decent, haven't paid yet for personal side stuff. But YMMV, check the pricing page https://cloud.google.com/stackdriver/pricing

bpp · on Aug 29, 2019

Been using Stackdriver for four years and will back up those who are saying it works very well.

Axsuul · on Aug 29, 2019

How has their client been with searching/tailing?

stickfigure · on Aug 29, 2019

Works great? They may have other tools, but I use the web interface. It has a sophisticated search language. Logs are conveniently grouped by request. The UI could be snappier, but I really have no major complaints.

Axsuul · on Aug 29, 2019

Thanks! Wow their 50GB free per month is super generous too.

samblr · on Aug 29, 2019

This.

After using stack driver, the setting up your whole logging mechanism in AWS atleast is so backward.

vonseel · on Aug 28, 2019

Does stackdriver logging become cost-prohibitive quickly?

chusk3 · on Aug 29, 2019

on the free plan it caps at 100GB of total stored logs, which then rotate out. It's great if you don't care about super-old logs.

halotrope · on Aug 28, 2019

Same here. It works remarkably well out of the box.

jacobsenscott · on Aug 29, 2019

As a startup you should be using one of the many logging services out there - definitely don't waste time rolling your own or trying to install some open source log aggregator in an EC2 instance or something.

For error tracking, which is mostly what you'll care about, use a service like honeybadger, or rollbar, or whatever fits well with your stack.

For performance metrics use a dedicated service for that as well. NewRelic, or Skylight, or whatever works well for your stack.

jrockway · on Aug 28, 2019

Yes, you want to have a single chain of events across all of your infrastructure. This is called "distributed tracing". There are a few solutions available; I recommend Jaeger.

You do need to instrument your applications to emit traces, but don't go overboard. Make sure everything can extract the trace ID from headers / metadata and that requests they generate include the trace ID. Most languages have plugins for their HTTP / gRPC server and client libraries to do this automatically.

You will want your edge proxy to start the trace for you; this is very easy with Envoy and ... possible ... with nginx and the opentracing plugins.

I use structured logs (zap specifically), so I wrote an adaptor that accepts a context.Context for a given request, extracts the trace ID from that (and x-b3-sampled), and logs it with every line. This means that when I'm looking at logs, I can easily cut-n-paste the ID into Jaeger to look at the entire request, or if I'm looking at a trace, type the ID into Kibana and see every log line associated with the request. (The truly motivated engineer would modify the Jaeger UI to pull logs along with spans since they're both stored in ES. Someday I will do this.)

As for log storage and searching, every existing solution is terrible and you will hate it. I used ELK. With 4 Amazon-managed m4.large nodes... it still takes forever to search our tiny amount of logs (O(GB/day)). It took me days to figure out how to make fluentd parse zap's output properly. And every time I use Kibana, I curse it as the query language does overly-broad full-text searches, completely ignoring my query and then spending a minute to return all log lines that contain the letter "a" or something. "kubectl logs xxx | grep whatever" was my go-to searching solution. Fast and free.

If anyone wants to pay me to write a sane log storage and searching system... send me an email ;)

dmoy · on Aug 28, 2019

> If anyone wants to pay me to write a sane log storage and searching system... send me an email ;)

You can pry lingo/sawzall from my cold, dead hands.

jrockway · on Aug 29, 2019

I am thinking more along the lines of interactive querying, not analysis. Show me all log lines for this request ID, or show me all log lines for the last 5 minutes from every instance of the job, etc.

Google has a system for that... but when I was there, it was awful. Meanwhile in the real world, we have ELK... and it's even worse. People stop looking at logs once the kubelet rotates it. It's just too slow and flaky.

(But yes... one key aspect to lingo/sawzall's design is that logs are sharded. And naturally, logs are sharded. Each program produces a log file over a period of time, and so (time, pod) forms a natural shard. Introduce something like ELK, and your sharding is thrown away, so you can never properly parallelize searching. A properly designed logging system would maintain shards and ensure that workers have replicas of those shards, so that you can use lots of computers to quickly get you the result you want. Of course, as much should be indexed as practical, so you can find the shard you're looking for without looking at every shard. Lots of work that could be done here, and it's all super easy. That's why it makes me mad that nobody has done this.)

jsmeaton · on Aug 29, 2019

Have you used sumologic or splunk? I feel like they both have the capabilities you’re talking about.

jshawl · on Aug 28, 2019

Disclaimer: I work for both Papertrail and Loggly's parent company: SolarWinds.

For general purpose logging - we deploy Papertrail's remote_syslog2 https://github.com/papertrail/remote_syslog2 - which is more or less set it and forget it setup. e.g. specify which text files I want to aggregate, and then watch them flow into the live tail viewer.

For logging in more limited environments (can't sudo or apt-get install), we use Loggly's http API (https://www.loggly.com/docs/http-endpoint/). Also, Loggly's JSON support allows us to answer questions like: "how many signup events failed since the last deployment". Or "What is the most common signup error".

Bonus! If you're looking for trace-level reporting and integrating that with your logs, check out the AppOptics and Loggly integration: https://www.loggly.com/blog/announcing-appoptics-loggly-inte...

pragmatic · on Aug 29, 2019

remote_syslog2 project doesn't seem to be very active. Still supported and maintained?

jshawl · on Aug 30, 2019

That is a great question - remote_syslog2 is a fairly mature project and is still the recommended way to aggregate your application/text log files. It does one job and does it well!

There is still active server-side development that does not show up in the rs2 repo on GitHub.

I will forward this comment along to our product team as feedback - thanks!

scoobyyabbadoo · on Aug 29, 2019

I love these ones.

binarylogic · on Aug 28, 2019

I'm biased because my team and I created Vector [0], but I'd highly recommend investing in a vendor-agnostic data collector to start. You can use this to collect your data and send it wherever you please. This will afford you the flexibility to make changes as you learn more, which will be inevitable.

[0]: https://github.com/timberio/vector

_1qd4 · on Aug 28, 2019

Don't try to roll it out all in one shot. Just work on solving problems. Database is timing out? Add some logging there. Requests getting dropped between proxy and app servers? Add some logging there.

If you try to add logging across the entire infrastructure in one shot, you won't know what logs you actually need. And when it comes time to diagnose a problem, you probably won't be capturing the correct data.

weq · on Aug 29, 2019

This is a good point.

For me, this looked like logging to a ringbuffer and then dumping that log with an associated error report when an exception occoured. was good enough for 99% of the errors i debugged, and we never actually needed a log-shipping solution. Logs were kept on disk and requested to be uploaded on demand when investigating specific issues.

it depends on what kind of startup u are in, what kind of product you ship, what kind of user base you have, what kind of solution you have. if you cobble together a set of SaaS solutions, ETL will be your integration challenge.

gtsteve · on Aug 28, 2019

Well I threw together a system which assigns a guid to each request and reports this guid to the user if something goes wrong. The guid is sent when calling across services internally so you can trace log lines across API calls and services.

The logs are written from containers to CloudWatch and consequently forwarded to ElasticSearch where we use Kibana and LogTrail [0] to view the logs and search them.

It's nowhere near as nice as XRay and other APM solutions but it hardly took any time to throw together. Fundamentally, this is how XRay works, only there is a specific format for the ID.

However, XRay now supports our runtime so we'll take another look at that. It looked like an interesting option at the time.

For a mobile app you'd want to assign a guid or some sort of user id to the device itself so you can track the distinct API calls it makes. I believe XRay and other systems support this but we don't have a mobile app so I don't know how that'd work for you.

[0] https://github.com/sivasamyk/logtrail

badrabbit · on Aug 28, 2019

I am shocked that no one has mentioned Graylog so far.

Check it out. It's done wonders for me. You can manipulate,sort,retain and do other things on log events with it. It uses elasticsearch to store the logs.

It has SIEM like functionality with alerts and they are continuing to make it more suitable as a SIEM replacement.

And it does have cloudtrail support.

nullwarp · on Aug 29, 2019

My only real complaint with graylog is it seems between V2 and V3 all the modules/packs (or whatever they call them) broke and you the useful ones are broken now.

Maybe it's better now since I tried but it was a real negative when trying to import some of them to find out later they were incompatible.

badrabbit · on Aug 29, 2019

It could be. I was never exposed to 2.X

keyle · on Aug 29, 2019

I was just searching for the word "graylog" as I was about to say the exact same thing.

colechristensen · on Aug 28, 2019

Centralized logging: SaaS services that do this are a dime a dozen. Sumologic, Datadog, Elastic, etc.

You seem to be interesting in tracing or APM [1] which also has many providers.

Lots of people do a local Elasticsearch, Logstash, Kibana stack which can be done without licensing with a variety of forwarders.

You might be most interested in Envoy Proxy or Elastic APM (there are many others)

https://www.envoyproxy.io

https://www.elastic.co/products/apm

1. https://en.wikipedia.org/wiki/Application_performance_manage...

kirktrue · on Aug 28, 2019

It sounds like you're looking for something like distributed tracing (vs. vanilla logging).

Zipkin (https://github.com/openzipkin) and OpenTracing (https://github.com/opentracing) purport to be vendor/platform agnostic tracing frameworks and have support with various servers/systems/etc.

X-Ray was pretty trivial to use in AWS land w/ Java as a client.

ElFitz · on Sept 2, 2019

I really didn't expect to get this many passionate opinions on the matter.

It took me some time to... build up the courage to read through all of your answers, and you have been of tremendous help. I've learned quite a lot. Thank you very much! I deeply appreciate it!

I'll steer clear of self-hosted ELK, for now, mostly because being the only backend, I can't really take the risk of holding the whole team back while getting it up and running or maintaining it.

I'll look into Splunk, Sumo Logic, Sentry & a few others, while keeping in mind the more general guidelines that were laid down here.

Also, thank you for the terminology! It's much easier to find the proper resources know that I know what to look for!

Edit: I'll also take some time to answer to the different comments; but it really felt rude of me to be procrastinating while you all had taken the time to properly answer

Sevii · on Aug 28, 2019

Log to disk. Rotate every hour and upload to S3. Download from S3 as needed and query via grep, awk, etc.

cbanek · on Aug 28, 2019

> trace a single chain of events across platforms

Since it sounds like you also control the app, maybe make an HTTP header that the app sends that has some kind of UUID for that transaction. When your backend gets it, keep passing it on and logging it as part of your context when you emit log lines. Then using whatever log aggregation system you use, you can search for that UUID.

As for collecting your logs, I like ELK stacks, and they are easy to set up and get all your syslogging to go there. There are also ready made helm charts to install these into a kubernetes cluster if you're using that, and they will automatically scoop up everything logged to stdout/stderr.

linsomniac · on Aug 28, 2019

Apache logs to go rsyslog via logger (apparently the best option with Apache 2.4). Syslogs go to a central rsyslog server over RELP (which mostly has been reliable, but a recent bug in rsyslog daily caused us to have to reload a week's worth of logs).

Central rsyslog server uses mmnormalize/liblognorm to parse the apache logs and load them into Elasticsearch.

haproxy logs directly to rsyslog via a domain socket, RELP to central server, lognorm to load into ES.

ELB logs go into S3, and logstash pulls them down and loads them into ES.

The remainder of syslog messages just go into files on the central server.

We also have Sentry set up with some newer applications logging into that.

atmosx · on Aug 29, 2019

The only worthy math teacher I came across once said "The right answer to 99,9% of the questions I will ask you in the oral exam is: it depends" while smiling cunningly. To answer your questions, the answer is "it depends". What you're really looking for however is not logging, what you're looking for is observability which has 3 pillars:

- Logs

- Metrics

- Tracing and/or APM

The above are true for systems and applications but let's talk applications. Your decision should be based on assessment of at leat the following:

- Do you have compliance requirements? (e.g. GDPR)

- What is your logs/metrics/traces retention period? (let's assume 30 days)

- What is your logs/metrics/traces lifecycle requirements? (are you going to need logs older the 30 days? If not, I'd say don't bother delete everything, keeping them around has managerial and hosting costs)

I advice to take a look at ElasticSearch:

- ElasticSearch for hosting logs

- For sending logs, metrics and tracer you can use filebeat, metricbeat and ElasticSearch APM or Jaeger.

If you are a small startup, I'd say go with ElasticSearch Cloud and use their tools. They do all you need and more.

[1]: I prefer metricbeat over prometheus/grafana because it solves the high availability headache for those who already have an ES cluster and you don't have to support (setup, monitor, manage, scale) an additional stack. You can use a push model which has it's own pros and cons.

ps. No affiliation with elastic, I just spent some time with a variety of their products and like what I see so far.

danesparza · on Aug 29, 2019

First, centralized logging is not just a good idea -- its key when you start working with multiple servers (which will most likely be almost right away). You need to be able to trace requests / responses / errors across your platform. Many tools (including logging library -> database and a custom log search / viewer) can give you this. Just pick something that works for your budget and development process and start there. To track a single chain of events, you'll just need to have a GUID that you pass between calls in a single request (used for logging).

Next, you'll want to track analytics centrally. Etsy and Netflix have been pioneers in this area. Their engineering blogs are very good to follow. Think: something like a timeseries database (like Influx / Prometheus) and getting data into it. Use tools like Grafana to get data out of it in dashboards or reports. This is separate from your application debug / error logging system.

The next step after this is developing something that consumes data from both of those systems and provides alerts based on unusual activity -- something that provides early warning to devops.

nodesocket · on Aug 29, 2019

Recommend DataDog logs[1]. It integrates with cloud providers and pulls logs from resources like load balancers, s3 buckets, etc. Additionally you can ingest from files on servers using the DataDog agent, and finally there are language SDK's to push log events from code.

[1] https://docs.datadoghq.com/logs/

jharohit · on Aug 29, 2019

log in-application directly to local Fluent bit instance (spool locally in case Fluent bit is down, log rotate) -> collect in a centralized Fluentd -> self hosted memory optimized ES (cause the default options and ES Cloud is shit) -> Grafana for monitoring & alerting

Having spent months with team, found this to be the best high performance stack for cloud & on-premise solutions for our clients

tilolebo · on Aug 29, 2019

how do you visualize logs with Grafana when they are stored in ES?

jharohit · on Sept 2, 2019

Grafana has an ES connector

__exit__ · on Aug 28, 2019

I used external services such as Sentry[0] and NewRelic[1], which allow one to access detailed debugging and performance checks on specific errors and API endpoints.

Aside from the classic print statements and grepping log files manually.

[0]: https://sentry.io

[1]: https://newrelic.com/

a10c · on Aug 29, 2019

The company I work for uses Splunk to ingest about 70TB/day.

Our services send to fluentd running on each instance which aggregate and flush to a Kinesis stream in AWS with KCL workers responsible for putting it through a separate pipeline that allocates the logs to specific indexes depending on the service(s) they come from as well as applying ACLs on a per-index basis.

monocasa · on Aug 28, 2019

I have a custom printf that logs to a ringbuffer in MRAM, which is sort of like battery backed SRAM that doesn't need a battery.

gbuk2013 · on Aug 28, 2019

Structured JSON logs to Elastic Search and local disk. ES gets “info” level, disk also has “debug” but only 2 weeks.

EamonnMR · on Aug 29, 2019

We've had great luck with our switch from 'write everything to files' to Graylog. We've got a bunch of different microservices and having all of the logs in one searchable place has been a boon. That and our switch to Kubernetes had made our logfiles harder to get at.

jillesvangurp · on Aug 29, 2019

There are plenty of logging stacks to choose from. I've used Elastic a lot and it has improved a lot. You can spin up a cheap cluster for under 200$ and start instrumenting your servers to send stuff to it. You'll want to set up index life cycle management to ensure you don't run out of disk space on your cluster. You scale by throwing more money at it. Basically you need to think in terms of millions of messages per day and retention periods. A 200$ cluster should be able to retain tens of millions of messages.

That's how you get started. There are plenty of tools on this stack to do APM, security auditing, request logging, etc. If you are using a decent application server stack that produces metrics, it can handle those too.

aflag · on Aug 29, 2019

I don't log anything. Deploy and forget. Sharks don't keep reliving their mistakes. Sharks swim fast, take a bite of whatever they see and move on. Are you a shark or a little fish? There's no time to answer that anymore, I've already moved on.

jesterson · on Aug 29, 2019

All OS and Apps log messages to local Syslog. Local syslog forwards all messages to central facility with GrayLog2 on it, where all search/visualisation/analysis happens.

This works flawlessly for years and relatively easy to set up. Couldn't recommend it more

madhadron · on Aug 29, 2019

> to trace a single chain of events across platforms?

For each individual instance of some class of things, generate a unique identifier. For example, each network request the mobile app makes to the backend should have a request ID. The mobile app includes that request ID in all its log entries and sends it with the request. The backend plumbs it through everywhere and all its log entries have it, too. If you have multiple instances of things in the backend, like batches of queries sent to a database, log an identifier for them as well.

Then you dump all of this into one big index in some semi-structured data store and use the identifiers to pull out all related entries.

MichaelRazum · on Aug 30, 2019

Note sure if this is the best solution. But works so far for me: ELK + Redis + Curator. Everything in a docker contrainer. Single Machine Setup. Curator deletes old logs. Redis is responsible for caching. Logs are put directly to redis. I think one of the most important metrics:

Performance 4 Core Machine with 32Gb Ram about 3000 logs per second. 70% CPU usage and 80% SSD Usage- Quite happy with the setup, since the SSD can be upgraded to a faster one. Also a more powerfull machine could handle about 10000 logs per second. Would love to hear other number from Splunk or similar solutions.

Costs: Nearly zero. Some time to setup and bring redis + curator into play.

peu4000 · on Aug 28, 2019

We're in the middle of a cloud migration, but in our dockerized environment we're sending logs directly from stdout to cloudwatch using Docker's cloudwatch plugin.

In our legacy environment we're writing to files and sending them up to cloudwatch using awslogs.

Cloudwatch is kind of ass for logging, but they added insights somewhat recently; it upgraded cloudwatch logs from being unusable to just being a pain in the ass to use.

This works for us so far because it's super simple and we don't have a major need for log analytics, just the occasional production debugging session.

I did a PoC for fluentd + logdna/logz/etc and that also seemed to work pretty well.

viraptor · on Aug 28, 2019

For structured / trace logging like x-ray, you need to do quite a lot of work in the app. It doesn't happen automatically. You can get a bit of it "for free" from NewRelic APM which can do some sampling of execution traces, but it's mostly around function calls, not custom spans. (You can define those too)

If you need just text output logging, there are a few solutions already described. But at this point you should really make a decision - are you after simple text logs, or can you put in work to get structured events or tracing out of your app.

nineteen999 · on Aug 29, 2019

We use Splunk with lots of redundancy (ie. multiple forwarders, indexers and search heads per site).

I'd probably use Graylog or some ELK stack variation though if our client would let us, since Splunk is $$$.

PascalW · on Aug 29, 2019

https://github.com/grafana/loki is very promising in this space. Dead-easy to run.

yogsototh · on Aug 29, 2019

Basic: log to syslog

Advanced: log structured objects (keys and string values) to Riemann. Write smart rules in Riemann then send those to ES and explore structured object in Kibana.

exabrial · on Aug 28, 2019

Over activemq using a logback openwire plugin, then off to graylog using an activemq input plugin.

Works great, can handle thousands of messages per second on modest hardware.

_skel · on Aug 29, 2019

For tracing, Zipkin is a good place to end up.

But before you get there, you can standardize on a "request ID" header that gets passed through your call stack and logged by whatever services receive it. You can search for it in your log aggregator (SumoLogic, Splunk, etc.) and get a good idea of which services your request went to, what time they got it, how long it took, etc.

humility · on Aug 29, 2019

1) Build a proper (local) logging service in your app. With Node.Js I use winstonjs/winston.

2) Use time series databases to log your server metrics. Eg. InfluxDB

3) Familiarize yourself with CLI tools like cat, less, tail, grep, sed for when you have to get your hands dirty with raw data.

4) Logrotate is a great choice to cap the size of different program logs.

emmelaich · on Aug 29, 2019

If you're asking specifically about associating events at different levels, see ELK's APM. Or other products with an APM component.

NewRelic might also help.

If you're asking about logging generally - it's a vast subject and you probably need to ask more specific questions. On a StackExchange site probably.

o1lab · on Aug 29, 2019

One of the short comings I found in many logging mechanism in backend is : There are no APIs to enable/disable logs at runtime.

Have made an small attempt to fix it in node.js

https://github.com/o1lab/dynamic-debug

flurdy · on Aug 29, 2019

Just remember to scrub the logs for privacy sensitive data.

Logs are number one for privacy violations. Pin codes, passwords, social security numbers etc. People remember to hash the data in their databases, but logs are often forgotten about, then stored and archived containing data that are illegal under many laws such as GDPR. And obviously also a security risk.

Developers most of time remember to scrub the data out of actual log messages but forget trace and rawer logged data also go into some log aggregators.

I am sure I have accidentally done it as well, though I try my hardest not to.

* Twitter: https://twitter.com/TwitterSupport/status/992132808192634881

* Monzo: https://www.zdnet.com/article/monzo-admits-to-storing-paymen...

* Github: https://www.bleepingcomputer.com/news/security/github-accide...

* Facebook: https://www.wired.com/story/facebook-passwords-plaintext-cha...

davinic · on Aug 29, 2019

Often you're not scrubbing individual data, you're logging the contents of an entire object. In this case, depending on the language, I will usually use an annotation/decoration on the object itself so the logging platform will know not to log it. I have also seen this method used in environments subject to HIPAA regulations.

mrkeen · on Aug 29, 2019

Instead of comments+data+logging as three sources of truth of what the system does/did, only do logging.

If your system observed something, it writes it down (logs it). If you want your system to react to that thing happening, then the log is going have to be machine-parsable.

Nursie · on Aug 29, 2019

When tracing across microservices, the approach we took was to embed a request id in the incoming request headers, then use it in every log line and propagate it on to other microservices as they are called.

Helps is see what happened across the whole system.

tiborsaas · on Aug 29, 2019

I run my node.js apps with PM2 and it supports logging out of the box. It probably won't scale very well, but for a single server app to run side projects it's perfect.

MorGrin · on Aug 29, 2019

Hey! I work for Coralogix, maybe you'd like to check us out :) https://coralogix.com/

paulmendoza · on Aug 29, 2019

AWS CloudWatch. I log everything as JSON and then use CloudWatch Insights to query it quickly. It is the cheapest solution I have found and is pretty easy as well.

gerdandy15 · on Aug 30, 2019

Scalyr.com and Humio have some of the strongest logging platforms in the market. Their approach of not creating indexes to search makes em easy to work with.

morten-gram · on Aug 31, 2019

Organizations are asking for modern approaches to log management issues. The availability of an index-free approach will enable them to get faster results at a fraction of the cost compared to traditional approaches. Index-free logging provides and entirely different approach that doesn’t involve indexing at all. This new wave of index-free log management incorporates three key approaches that are increasingly attractive to users looking for a modern approach: * Reduce the amount of data you have to manage; * Reduce the amount of data you analyze; * Trade off a slight decrease in analytics on historical data for much faster ingest, larger flexibility and better efficiency on hardware usage. See more here: https://www.itopstimes.com/monitor/humio-index-free-logging-...

ycmimi · on Aug 28, 2019

You can try our SaaS solution free! https://datawiza.com It gives you full observability to all your APIs. Compared to existing solutions, we save you from tedious and heavy work, such as configuring logs, parsing, extracting, and enriching data in your logs, building dashboards, and etc. All you need to do is installing our software in 2 mins, then you get comprehensive dashboards for all your API activities.

jmcd17 · on Aug 29, 2019

We just recently started using CHAOSSEARCH (chaossearch.io) in combination with Fluentd. Highly recommended!

davewasthere · on Aug 29, 2019

on old projects, Nlog to local file log and Elmah pointing to a centralised database.

on newer projects, Serilog with combo of text file logging, sql sink and just recently to centralised ElasticSearch

DeepYogurt · on Aug 28, 2019

Give Mozdef a look if you want alerting on your logs.

beamatronic · on Aug 28, 2019

System.out.println()

xyz-x · on Aug 30, 2019

About 7 years ago, when I was trying to monitor .Net apps, there weren't that many alternatives available. The Elasticsearch-Logstash-Kibana stack had just gotten off the ground and I needed a way to send it structured logs and a large number of machines. Logary.tech was born to solve that problem.

Since then Logary has expanded with excellent support for sending both metrics and tracing data to a large number of targets. In production, I use this setup;

client browser -> Logs & metrics to Logary Rutta HTTP ingestion endpoint via Logary JS

nginx-ingress -> Traces to Jaeger Agent via opentracing C++ client nginx-ingress -> Metrics to InfluxDB nginx-ingress <- Metrics via Prometheus scrape annotation

Our NextJS site and GraphQL server: site -> Traces to Jaeger Agent via opentracing site -> stdout logs via Logary JS (also get added as Logs in the Span of OpenTracing) site <- Metrics via Prometheus scrape annotation and prom-client

api -> Traces to Jaeger Agent via Logary's F# API api -> Metrics to InfluxDB via Logary's F# API api <- Metrics via Prometheus scrape annotation and Logary.Prometheus api -> Events to Mixpanel via Logary.Targets.Mixpanel api -> Logs to Stackdriver via Logary.Targets.Stackdriver (hosted on GCP)

Also, Kubernetes ships logs via FluentD to Stackdriver in GCP, but they are not structured, and the remaining infrastructural services also send traces to Jaeger if they can.

Logary Rutta is a stand-alone log router, written in Hopac + F# (like Concurrent ML), and used by some of the largest Swedish software companies for thousands of logs and metrics per second. It's capable of shipping to a large number of targets https://github.com/logary/logary/tree/master/src/targets Since it talks HTTP and UDP with a number of encodings (JSON, plain, binary), it's easy to plug into an existing infrastructure and existing log shippers. It can also connect point-to-point to itself with a high-perf binary encoding. Because you can send any JSON into it, it's very easy to get started with together with mobile apps.

Logary for JS currently has support for user logs, and I'm currently testing rudimentary metrics and browser info.

Logary for .Net supports the OpenTelemetry spec, structured logging and metrics.

Of course you can pick any toolchain you want, but I've had great success (and great fun!) writing and using the above. You can see I don't keep logs on disk; it causes them to fill up; if your network is down, your service is down, and then you know it's the network anyway.

Once in Logary, you can choose where you send them. I've done an analytics/ETL pipeline based on Logary with its Stackdriver+BigQuery+GooglePubSub targets and with Flink, with great success as well. Logary is free to use for non-profit and then I have a pricing calculator on the home page, for when you start selling the software you build. Pricing aside, how Logary is structured and how I've used it might give you some hints on how to do it yourself.

bt848 · on Aug 28, 2019

Never centralize logging. Log at the leaves and store it there. Push search predicates down to servers running on each leaf when/if needed. Log to sockets always; your FD can be a regular file if you want but keep the flexibility to change it later.

viraptor · on Aug 28, 2019

This doesn't work well unless all your servers are custom pets or real hardware. If you have ephemeral instances scaling up/down, you'd lose history this way. Also you'd affect performance of the service, likely when you need it most - when the app is having issues and you're trying to debug it. There may be also limitation around data retention on a single machine.

If you're doing distributed containers, lambdas, or other more ephemeral things, you just can't do logging at leaf unfortunately.

millettjon · on Aug 29, 2019

Leaf logging is also susceptible to attackers deleting log files. Central logging is effectively append only from the leaf and thus provides a security benefit.

weq · on Aug 29, 2019

So the attacker starts DoSing your central log server and you do what?

bananocurrency · on Aug 29, 2019

how is my logging server being attacked from outside my network?