More

ankitnayan · on Feb 9, 2021

Nothing wrong there. If enough users want, we can add clickhouse also

ankitnayan · on Feb 9, 2021

Correct, Ideally monitoring stack should be outside the blast radius of other applications. Will handling another Kafka cluster (probably smaller than business Kafka) be a pain for the team given the team already knows managing one business Kafka. What do you think?

ankitnayan · on Feb 9, 2021

Grafana, for long, has been used to monitor time-series data and recently has been moving towards observability (including traces and logs). We are different in quite a few fronts.

1. There are specific observabilty specific UI widgets like serviceMap, SLOs and error budgets, I don't know whether Grafana provides it now. Also, last I used Grafana, linking and moving from one dashboard to another is still a pain. You can get a better idea of how different observability UI can get from Grafana by looking into LightStep demo.

2. We can run aggregated on filtered traces. Eg, I can get 99th percentile response time of a tag say payment_channel. am afraid this can be extracted from traces by Grafana.

3. SigNoz is easily extendible by adding your stream processing application to slice n dice data in your own way

ankitnayan · on Feb 9, 2021

I completely agree with you. For companies not already using Kafka, this will ask for a big commitment to self-host Kafka.

You mentioned a great approach. Queueing system as a plugin. Thanks

ankitnayan · on Feb 9, 2021

Great point. To start off we shall provide different hardware configs like micro, small, medium, large, xlarge with the scale that they can handle.

We soon plan to emit metrics from different components of SigNoz and setup autoscaling of different components. Druid has already put some thought in autoscaling. Checkout https://druid.apache.org/docs/latest/configuration/index.htm... and https://www.adaltas.com/en/2019/07/16/auto-scaling-druid-wit...

ankitnayan · on Feb 9, 2021

I was pretty much surprised to see the results too. A single node Kafka with 2GB as xmx value, was ingesting at 4500 events/sec (around 1MB/s) on a single partition.

I blogged my experiments with SigNoz's scale at https://signoz.io/blog/signoz-benchmarks/. Hoping to get better in fine-tuning configs and blogging.

gen220 · on Feb 9, 2021

I think the concerns raised in this thread are less regarding raw throughput, and more about (1) the complexity of the typical production Kafka deployment (2) the arguably unnecessary, highly complex ecosystem around Kafka that you have to pay people or companies to use effectively, (3) the history of problems regarding data loss with ZK/Kafka, caused by leadership election bugs.

stmw · on Feb 9, 2021

Exactly right. In my personal experience, Kafka's reputation for data loss and other mishaps is well-earned. Some of them are well explained by Jepsen tests.

ankitnayan · on Feb 9, 2021

hmm..I get your point. I searched for Kafka alternatives for a bit before including it on our stack. Though, I couldn't find something more adopted by all. It would be good to know a few Kafka alternatives you prefer which can handle equivalent production scale?

gen220 · on Feb 9, 2021

I agree with my sibling comment, and reiterate my cousin comment that you've replied to (commenting here to complete this sub-tree).

Queuing technologies will come and go, IMO it's better to focus on the interface, and allow people to swap in whatever implementation they prefer and are accustomed to. It also benefits you in the long-term too, because an application that is less-coupled to a particular external dependency will be easier to test.

Some examples of queuing tech that's deployed successfully at scale: Redis Streams, RabbitMQ, Amazon's SQS. Since this is written in Go, you could even offer an in-memory, channel-oriented stream implementation, with no external dependencies.

Not one of these is universally better than Kafka: each offers a set of trade-offs, but a very similar interface from SigNoz's point of view.

For SigNoz's hosted/tenant-based solution, it might absolutely make more sense to use Kafka. But self-hosted users bring different trade-offs to the table, and might prefer to use another solution.

Strategically, can write/maintain the plugin for Kafka (very similar to how you operate right now, except it leaves the door open to more plugins existing in the future), and encourage community contributions for other tech. Or, when you're big enough, you might want to employ people to maintain those plugins too, since they're good for adoption.

ankitnayan · on Feb 9, 2021

really liked the way you put things to clarity. Thanks for these inputs and suggestions, will definitely think harder on this.

dflock · on Feb 9, 2021

Have an interface for a queuing system and support other things, not just Kafka. Ideally, you want a default/dev instance to ship with something super simple, zero setup and maybe in-memory - but allowing you to swap-in kafka or something more capable as needed.

pranay01 · on Feb 9, 2021

That's an interesting point. Curious, would you use a project which supports a simple/in-memory datastore, but not anything which would be useful in production environment? Do you think that easy to get running and setup in dev environment valuable for adoption - even if it won't work in prod?

I am trying to understand - what would be a good way to prioritise.

dflock · on Feb 10, 2021

> Do you think that easy to get running and setup in dev environment valuable for adoption

Yes, it'll make a big difference to adoption. If step one of your setup instructions are "provision a Kafka cluster", then you are going to lose 90% of people right there.

Ideally, your dev install is super simple and has a built-in in-memory queue thing. The key here, is to make it as simple as possible to get started. Once people have tried it out, and become invested in you, then you can say "for production scale, use Kafka instead of FastQ/SimpleQue/Whatever".

They key to that second step, is to have your product abstract the queue functionality it needs, into an interface that it uses to talk to the queue - allowing people to swap out queue backends with a simple configuration change.

So, make it simple to get started - and simple to scale up when you decide to.

ankitnayan · on Feb 10, 2021

hmm..got your point. We shall definitely look into other queuing system to be integrated as interface. Trying to understand better, what's a super simple dev setup like (to get the adoption)?

Right now, we can run SigNoz with all components including Kafka and Druid in 4GB memory supporting around 200 events/sec. Though, will need to check whether this micro setup passes a run of a few days.

dflock · on Feb 15, 2021

What you have now isn't too bad, but figure out if you can get it down to one single command. Have a look at what netdata does (https://learn.netdata.cloud/docs/agent/packaging/installer#a...) - this is a single command to install, works really well and is super quick to get started on a single node.

gen220 · on Feb 10, 2021

Redis is like this. It runs in-memory by default, but can be trivially configured to write to disk / be persistent between server sessions. Many credit this feature as a catalyst to its adoption.

sciurus · on Feb 10, 2021

AWS's Kinesis, GCP's PubSub, whatever Azure has.

GordonS · on Feb 10, 2021

Azure has Azure Service Bus (a "full" messaging system, with subscriptions, topics, routing, AMQP 1.0 etc), and also Azure Storage Queues (a very simple queueing system where a client polls a single queue for messages).

pranay01 · on Feb 14, 2021

Thanks, but won't restricting to a specific cloud service reduce adoption?

GordonS · on Feb 14, 2021

I was just outlining what the messaging options on Azure were, for the parent poster.

ankitnayan · on Feb 9, 2021

Nice thoughts, A few other users also pointed this out.

We observed enterprise and other Observability SaaS vendors have some scripts and controllers to keep running these components. We plan to open-source that too. As you rightly pointed out running OSS needs man hours and we will try to remove those frictions.

Also when working with Prometheus and Jaeger, we observed people anyhow have to use Kafka to handle scale and mostly OSS are good at start but become pretty complicated at handling scale. Eg, Prometheus long term storage solution is Cortex which itself is difficult to manage. In that case, Kafka should be better beast to handle than multiple moving components inside Cortex. We built SigNoz as a scalable alternative inspired from stream processing architecture.

We will also be proving sampling strategies including tail-based sampling to retain important data and not unnecessarily clogging disks.

ankitnayan · on Feb 9, 2021

Hey, I am one of the maintainers of SigNoz. ELK is tightly coupled to Elastic which may not be the ideal database to handle opentelemetry data. We wanted to be more of a platform where we can provide different DBs as plugins. Users can also build their own usecases by building more stream processing applications.

On the other hand, Druid powers analytical queries on data and is efficient in handling high-dimensional data. Many companies use Druid at scale (https://druid.apache.org/druid-powered).

Also Jaeger, a distributed tracing tool, provides plugin for cassandra, elastic, badger, etc. Some users found limitation in running fast aggregation of filtered traces. With Druid we can now search by annotations(without need of service name) and get aggregates on filtered traces, like p99 of version=xyz filters.

ankitnayan · on May 4, 2020

Hosting your own solution on Heroku might be costly and not worth maintaining. Go for commercial products like Graphite or Collectd + Grafana if all you need is to send your own custom structured metrics. If you need APM capabilities, checkout DataDog

ankitnayan · on April 7, 2020

If I have 8 services sending a combined total of 100GB of traces per day and 1 extra service that alone sends 100GB of traces per day, how is your pricing justified either to LightStep or as a customer?