Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Best time series database currently
21 points by UndefinedRef on Nov 30, 2017 | hide | past | favorite | 16 comments
I am looking for a database to store time series data from sensors and found CrateDB. It looks very interesting. Does anyone have experience with it?

I need something that will scale well horizontally, which it seems CrateDB should be able to do, and CrateDB also has a MQTT broker built into their enterprise version.

Or maybe some other alternative that will scale well?

I am looking at influxdb and Prometheus also but the MQTT broker in CrateDB really appeals to me.



Just have a look at Timescale. It's a open source postgresql extension, and can sit side by side with other stuff. Scaling postgresql should be no problem for you.

https://www.timescale.com/


I don't think PostgreSQL scales nice horizontally. I do use TimescaleDB though and it scales really nice vertically, but would probably be hard to make it run on multiple machines.


I see timescale more as an alternative option, since the OP also aksed for this. - And you're right, since Timescale is an postgresql extensions, it also has the same benefits/problems as postgresql itself in this manner.


Thanks for the recommendation of Timescale :)

Here are some more details on our future plans for clustering. We do have horizontal scale-out clustering on our roadmap and it's hard to say exactly when it will be released, but we are aiming for the 2nd half of 2018.

That said, we do often find that there multiple reasons why people ask about "clustering" or say they need scale-out:

A. Because you want to scale the amount of available storage - (we allow you to elastically add disks to scale-up the capacity on a single hypertable, have had customers scale a single hypertable to 500B rows)

B. Because you want high availability - (we support this today, via PostgreSQL streaming replication and will be documenting this further)

C. Because you want to support more concurrent queries - (supported today across primary replicas)

D. Because you want to support high ingest rates - (depending on your use case, we have users doing 100-400k rows / second)

E. Because you want to parallelize individual queries (that touch a lot of data) - (some support for parallelization today, more to come)

So we do meet the needs of many today without support for full scale-out clustering (scaling vertically, as jurgenwerk points out). If your requirements are closer to millions of rows per second inserts and storing 100s of TBs / PBs of data, we can't yet support this, but working towards it!


I haven't gotten to really use it that much, but I'm a fan of this tool!


Cool thanks. I will check it out.


OpenTSDB scales well, 10 million DPS writes on a 36 node cluster for example. It's not easy to get working well at scale out of the box. That is something we really need to work on in that community. There is a project called Splicer (github.com/turn/splicer) that will shard the incoming queries into 1 hour blocks and cache the 1 hour blocks, it also sends the queries to the same node where the region server is that hosts the data. This makes the queries VERY fast.


how about kdb, the time series in memory db, long history in financial services


We use DynamoDB at my place. We had to make some tooling around it and figure out some growing pains, but now we have an extremely scalable solution.


I don't know anything about this, but it might help if you provide information on what you want to do with the data.


My main concern for now is to be able to store/write the data collected from a million IoT devices in a scalable fashion, so there is not really a requirement for what to do with it other than, when the data is there it will be analyzed on an adhoc basis and then we will see what we can do with it.


That is going to be hard to achieve.

To know how to store the data, you need to know how you will query it.

If you know how you will query it, you can say devise a way to store it in Cassandra.. which will scale up to PB.

If you just throw in a PB of data, I am not aware of ANY system that is going to let you drop ad-hoc queries at the data and get fast answers. You effectively need to load most / much of the data off disk to process it.

If you only want to store it and not query it yet.. store it off on flatfile. When you decide how you will query it, load it from the flatfile and switch writes to your new system.


If you were using OpenTSDB: 1 million devices, reporting once per minute. We'll round that up to 17k submissions per second. Let's say each submission is, um, 50 datapoints. That's 850,000 datapoints per second. Not really that hard to store. With 12 tags, OpenTSDB stores that in about 100 bytes per datapoint. You'd be writing 80MBps of data across you cluster. Not a big deal there. Let's say you have 12 nodes with 8 disks at 2TB per disk. 192TB with 3x HDFS replication, so 64TB of usable space. That gives you 828,504 minutes of capacity. 19 months or so. Queries will be fast also, but you need to set it up well, and, it's easy to do it wrong (sorry about that, we're making it better). I recommend Splicer with 4 or 5 query instances per data node to parallelize the queries and take advantage of locality (hit the query instances on the same node as the HBase region).

I missed where the Petabyte of data came in, but I just made up some numbers here that are accurate from my experience.


I recently built a system capable of handling this load, but is was based on MongoDB and thats not an option in this project because of hardware constraints.

Cassandra may be an option. I will look into it.


Maybe this is a better option for you.

http://www.scylladb.com/


It may be worth looking into what AWS has to offer. Their IoT offerings seem to be pretty solid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: