More

jodok · on May 5, 2017

thanks for the comment, we'll update the indexes used in postgres. we tried to be as objective as possible and of course show our sweet spot. we don't want to treat postgres unfair, if you have suggestions on how to improve indexing in postgres we'd be happy to take that into account. (full disclosure: i'm co-founder of crate.io).

jodok · on Dec 15, 2016

@ralala - would love to hear your feedback once tried crate (jodok@crate.io). we've been working directly with kyle on the issues discovered. the current state of our work is documented on our website: https://crate.io/docs/scale/resilience/

Thankfully, for most use-cases, if you our follow best practices, you are extremely unlikely to experience resiliency issues with CrateDB.

jodok · on June 21, 2016

why not run the database directly in the container as well? we're doing that with https://crate.io - just expose your local instance store as volume, crate will take care of the rest

jodok · on Nov 22, 2015

https://crate.io Co-Founder here. We're participating in this game by working hard to build a fully distributed, shared-nothing SQL database.

In our vision, a app in a container is able to access the persistence layer like in SQLite - just import it. another cluster of containers - preferably having one instance node-local - takes care of the database needs.

The database is distributed and makes sure enough replicas exist on different nodes. it's easy to scale up and down, local ressources are being utilized whenever possible (no NAS/SAN like storage).

jodok · on July 2, 2015

cool stuff

jodok · on April 6, 2015

full disclosure. i'm one of the co-founders of crate :)

In December 2010 we found about Elasticsearch and were truly amazed by it's simplicity, speed and elegance. We built our service and consultancy business around it.

In 2011 we've built some of the largest ES applications at that time (6TB, 120node cluster, http://2012.berlinbuzzwords.de/sessions/you-know-search-quer...) and started to develop a set of plugins, such as the in-out-plugin to allow distributed dump/restore.

With this background - and the mission to build a datastore that is as easy to use and administer as Elasticsearch we founded in 2013 Crate and raised some Seed money. Since that we're working hard to make this vision become true. We're often confronted with the results of the so-called Jepsen-Test (https://aphyr.com/posts/317-call-me-maybe-elasticsearch) that Aphyr published in 2014. Don't forget: Lucene, Netty, Elasticsearch, Crate - all are Open Source products (APL) and rely on all kinds of contributions - such as this analysis! No matter be it bad news or good news. We can only improve based on hard testing and feedback. However, this caused a lot of rumblings in the Elasticsearch ecosystem and the reaction of Elasticsearch was exemplary:

1) explain the reasoning and make an official statement: https://www.elastic.co/blog/resiliency-elasticsearch/

2) list all the issues and hunt them down. one by one: http://www.elastic.co/guide/en/elasticsearch/resiliency/curr... (and add new ones as they occur).

3) stay in contact with the community that reported the issues: https://twitter.com/aphyr/status/525712547911974913

All that being said. We see many people using Crate as primary store (and of course backing up their data) but we also see people that don't put that much trust in a younger database and keep all their primary data in another location and sync/index to Crate.

ALWAYS make backups (COPY TO / COPY FROM), make sure you have replicas, and most important configure minimum_master_nodes correctly to avoid split brain.

At Crate we stand on the shoulders of these great Open Source project, try to be as good citizens as possible and focus mainly on our Query engine (Analyzer, Planner, Execution Engine).

jodok · on April 6, 2015

It's named Blender Pro and has been designed by our swiss friends from binnenland. We're super happy that they provided us with a generous license to use it within Crate. More about the font: http://www.binnenland.ch/notes/view/about-the-blender-typefa...

jodok · on April 6, 2015

We run aggregations fully distributed and when iterating over the values we heavily rely on the field-caches. They hold the values of the latest used fields in memory and therefor allow in-memory performance on them. for example they don't grow linearly with the amount of rows stored, but depend on the cardinality of the fields. Running aggregations over non-indexed data is not supported.

jodok · on April 6, 2015

not yet. but we have some "tweaks" for exactly your use-case on our backlog. using the client libraries should make it mucn easier (e.g. https://crate.io/docs/projects/crate-python/stable/sqlalchem...). so right now you would need to do a refresh. on a side note: it's not an index delay. it's the readers that "sit" on the lucene index. they are being repurposed for performance reasons (and meanwhile other writes are appending). like the client libraries you can force reopening them (https://crate.io/docs/en/0.47.8/sql/reference/refresh.html) - of course at the cost of performance.

jodok · on April 6, 2015

we recommend to expose a host directory to crate ('docker run -d -p 4200:4200 -p 4300:4300 -v <data-dir>:/data crate') and configure replicas. if one of the crate containers disappears, replicas will be promoted as primary shard and new replicas created on the fly. it's also possible to expose multiple directories (e.g. on multiple disks for more performance), you can configure crate to use them in parallel.

jpgvm · on April 6, 2015

I guess this is using ES multiple datapath support?

Are you also planning to move to single shard per datapath like ES? If that is the case what is your thoughts on increasing the shard count post single shard per datapath?

dobe · on April 7, 2015

I guess you are talking about the plan to have the data of one shard only on one disk (see https://github.com/elastic/elasticsearch/issues/9498)? This does not necessarily mean that you will end up having only one shard per datapath - only if you have just one shard per node. But you are right, the change might lead to unbalanced disk usage in some scenarios, where increasing the number of shards would solve the problem.

There are two options:

1. (Recommended for now) Export the table with COPY TO ( https://crate.io/docs/stable/sql/reference/copy_to.html). Drop the table and then import it again using COPY FROM (https://crate.io/docs/stable/sql/reference/copy_from.html ).

2. Use insert by query (see https://crate.io/docs/stable/sql/dml.html#inserting-data-by-... ) if it is ok for you to copy the whole data to another table (with more shards).

1) is recommended, since it allows for throttling on import time (see https://crate.io/docs/en/latest/best_practice/data_import.ht...) and also does not require a rename of a table, which is currently not implemented but is on our backlog. However i think once ES 2.0 is out we will have table renames and also throttling in insert by query, so option 2) will be recommended then.

Our genreal recommendation to the fixed number of shards limitation is to choose a higher number of shards upfront (number of expected cores matches the most use cases) or to use partitioned tables (https://crate.io/docs/en/latest/sql/partitioned_tables.html) where possible since those allow to change the number of shards for future partitions.