Hey everyone, we'll be hosting a live webcast and Q&A with our engineers this afternoon to introduce the new changefeeds and cluster management API in RethinkDB 1.16.
It was very easy to setup, the web interface looks like a wonder, has joins and now realtime push and cluster management? I'm giving this a try! Why don't all the databases come with such beautiful web interfaces?
RethinkDB is awesome, but to my understanding it's one of the slowest document stores out there right now. That's one reason why it hasn't seen very widespread adoption yet. For smaller projects it's not a problem, but for large projects databases can often be a major bottleneck.
RethinkDB performance has improved dramatically in the last year. We'll be posting benchmarks for the 2.0 release, which should dispel any performance concerns.
That's great. I'd love to see you guys completely replace the MongoDBs of the world. Once you eventually reach MongoDB's performance across the important areas, switching would be a no brainer for me, and probably for many others.
RethinkDB CEO here. Don't worry. During early stages of ReQL design I suggested string representation for commands (that start with the dollar sign) and the engineering team kicked me out of the room because that would have opened RethinkDB up to injection attacks. If I suggest something like write durability off by default (or something of the sort) I'll have a full mutiny on my hands.
We're systems people. You won't get another Mongo.
MongoDB has been pretty decent (the defaults early on were probably not the best choices)...
The biggest difference in RethinkDB's approach has been data sanity first, niceties a close second (nice devops and dev interfaces)... It's a very well thought out database, and as long as people are moving a little farther away from SQL thinking and interfaces it makes more sense.
I'm not sure where automatic failover stands wrt RethinkDB, but barring that, I would probably reach for RethinkDB, MongoDB or ElasticSearch for most small-medium db tasks, depending on what the application's needs are. By medium I mean less than 10-15 server cluster. Anything bigger, Cassandra would probably be my first choice.
That's exactly why we added the administration api to our query language in this release. Previously we had some command line tools that were really a pain. Now you can start development using the webui, and then shift to scripting massive deployments once you get there
Wow, the realtime push feature will be a game changer. It removes so much of the boilerplate required to build server push apps. I'm super excited to start using this.
I've been watching RethinkDB and it looks really cool, but...
One need I have in nearly every project is record versioning. Think implementing a modern wiki, in a CRM (what did X alter? Which phone numbers have been assigned to this person?), versioning updates to data records etc. Realtime web with multiple users editing records simultaneously, it's even more important.
Versioning is a pain to implement in every system I've used, and I'm looking for something to lower that pain.
CouchDB et al have built-in versioning, but cause pain in other places. I've looked at the RethinkDB docs and found no mention - so how would RethinkDB users handle versioning, and are there any helpers on the way?
The app I'm building uses incremented version counters as a way of protecting against simultaneous edits. I'm discarding the old version of each document, but it would be pretty trivial to save them using the "returnChanges" option of r.update() [1].
If you set returnChanges, the result will contain two properties, old_val and new_val. You could take the contents of old_val, give it a unique identifier, and then store it in another table with r.create().
You can add a field "last_updated", and update a document with the option {returnChanges: true}, and you'll get the new and old values.
You can then save the old values in another table like "history" where the primary key is `_id` instead of `id`.
In this case, you can do
r.table("data").get(1) // get the current version of the document with id 1
r.table("history").getAll(1, {index: "id"}) // get all the previous versions of the document with id 1. And you can order them by `last_updated` if you want too.
Ah, Ok, I stand corrected. I only used CouchDB very very briefly some years ago and was under the illusion that its document version numbering could be used to keep track of document changes. I didn't realise it was purely for MVCC purposes.
Yes, Datomic advertises that it keeps track of a documents history.
That's a fair point. I sat down and tried to sketch it on paper after posting my comment, and found there was too much domain knowledge involved to do versioning directly in the database.
My comment can be rephrased that it's still painful to do this at the application level: for such a common task, the frameworks/libraries/software I work with don't have anything baked in to handle it.
I've started trying to use Git as an additional data backend to hold the versioning, but changing from a JSON/Database storage structure to one Git will happily diff and reconcile is in itself not pretty.
Super excited for this! I have other comments to this effect, but try out RethinkDB. It has been absolutely great for us and what we're doing and it's been rock solid in its reliability.
We're implementing real time features in our product in the next couple of months so this could not have come at a better time.
I'm really excited about RethinkDB, and am using it in my (yet-to-be-launched) startup. The new changes() support seems very interesting for my app, but for my use case I'd need several queries open per user (probably on the order of 10-100).
I missed the Q&A this afternoon, but I see lots of RethinkDB engineers on here... so, would there be severe performance implications of holding open that many cursors?
If you go over a couple 1,000 active changefeeds in 1.16, I recommend setting the `maxBatchSeconds` optional argument to `run` to something like 10 (the default is 0.5).
That change should significantly reduce the CPU overhead on the server if you have lots of idle changefeeds. Note that this does not affect how quickly changes get delivered - you will still get changes instantly.
The exact performance will of course depend on what queries you're going to run exactly, the rate of writes etc.
This blog post states that community members are working on an integration, but is that work happening publicly? I'd be iterested in following it even if it isn't ready for use.
From the FAQ: "For multiple document transactions, RethinkDB favors data consistency over high write availability. While RethinkDB is always CID, it is not always A."
What is the relation between write availability and A(tomicity)?
I think this sentence is just phrased poorly. It means to convey that there is no atomicity across multiple documents, which is unrelated to the previous sentence. I've opened an issue to fix that: https://github.com/rethinkdb/docs/issues/633. Thanks for noticing!
So, I am right in reading that there's no automatic failover? Also, when a new master is chosen, do you pick the replica with the most up to date log or do you hand that decision off to the administrator?
Sounds like you need to implement raft. Do you have any plans in that direction?
Everything looks amazing, but what is their business model? How will they fund all this development in the long term? I cannot find any commercial offerings on their website.
What are the performance characteristics of realtime push? Does the performance of inserts slow down with the number of subscriptions to change feeds? Or, is insert performance unrelated to subscriptions? Also, does the change feed only show the before/after or does it also show the query that was used to transform the data?
The idea behind the architecture was that performance should be significantly better than rolling your own infrastructure, because the database has a lot of information that userland (from the database perspective) software doesn't.
The performance of inserts might slow down slightly (matter of microseconds in insert latency) if you create many feeds. The database has to look at each insert and figure out if it applies to any of the feeds. This code is written in optimized C++ and is very fast. We're still running benchmarks, but we're shooting for performance levels where you (as a user) might not even be able to measure the difference.
The same applies for inserts that aren't affecting feeds (on a per table basis).
Same goes for throughput -- it might slow down slightly, but we're shooting for making the slowdown barely measurable if at all.
EDIT: in clustered environments, if you're subscribed to 1000 changefeeds on machine A and a write happens on machine B, we do constant work on B to send the changes to A and then A does all the work to figure out which changefeeds need to see it. TL;DR: We don't block out other writes for time proportional to the number of feeds.
Are you doing anything clever to figure out e.g. which subset of changefeeds subscribed to a table might be interested in a given update?
Let's say I have a table containing data for many users, while each subscription only needs data for a single user. Instead of scanning through all the changefeeds, you could put subscriptions in a hashmap and figure out which ones to update in O(1) time rather than O(N) time in number ob subscriptions, per update.
Obviously this is much harder in the general case, but do you do anything along these lines?
_This code is written in optimized C++ and is very fast._
Can you elaborate?
_but we're shooting for performance levels where you (as a user) might not even be able to measure the difference._
Benchmarks are almost always skewed towards the preferred workload of the DB (they're like science experiments, the result is heavily biased). How are you ensuring this isn't the case with your benchmarks?
Also what are you benchmarking against? I'd like to see one against Aerospike.
Does this help for auto-failover? Can I now do something like set up a changefeed on table_status to monitor it and the call reconfigure? Or is there more work still required?
PS Absolutely loving RethinkDB, I've been using it as my main database since April and its a joy to work with!
Auto-failover is hard because of some fairly deep-rooted aspects of how RethinkDB is implemented. When one of your servers dies, the RethinkDB cluster won't allow any reconfigurations until it reconnects. If the server is permanently dead, you can tell the RethinkDB cluster to go on without it; but you shouldn't do that unless the server is actually permanently dead, because the server won't be allowed to rejoin the cluster later. This makes it hard to write an automatic failover tool because you'd need to be able to tell the difference between a server that's permanently dead and a server that's just dropped its connection for a bit.
The solution is to re-architect RethinkDB so that it can reconfigure a table even if there's a disconnected server. This is a pretty big project, but we're working on it, and it will probably ship around April or May. We'll also include server-side auto failover at that time, because it's easy once this problem is solved.
RethinkDB 1.16 doesn't have auto-failover yet, but ReQL admin still helps because you can integrate it with external cluster management tools that you could use to initiate the failover.
We are actively developing the foundations for integrated auto-failover at the moment, so auto-failover is going to come soon.
Looking at the docs it doesn't appear this is meant to stream to the client, but to the server. From there you would still need to manage queues and sockets, etc.. I've had to write several exchanges that do something along these lines (stream order data to a client) and I've always accomplished it by having the code that writes to the db also push a client update to a queue.
I guess my question is why would this be a preferred solution? It seems to run afoul to the 'one job' design goal. What am I missing?
Hi Chris, great question. There are a couple of advantages actually.
- You don't have to write logic to figure out which clients need to be updated with some data. You just run the queries that you need to to generate the data for the client, and then RethinkDB sends you an update only if the result of that specific query changes.
- Having the thing that modifies the data send the updates to the connected clients becomes increasingly difficult if you have multiple application servers. Then you would need to set up some separate message passing / broadcasting infrastructure if you also want to update clients connected to other servers. RethinkDB takes care of "passing the news around", even in a distributed environment.
- RethinkDB supports changefeeds not just on the raw data, but also on transformations of the data. Not all transformations are supported yet (for example map reduce queries are not), but our goal is definitely to support changefeeds on pretty much any query. Just knowing that the underlying data has changed isn't enough. In a traditional setting, you would still need to either recompute the whole query, or implement your own logic for incrementally updating their results for every type of query you want to run. RethinkDB updates query results incrementally and efficiently.
As I mentioned the code that writes the event to the db notifies the client about it, so it knows about the client. This isn't really an issue.
In the case of an exchange I never split a single order book across multiple servers, but I can imagine a lot of applications where this could be an issue. How do you handle data consistency across nodes? Ultimately you have to solve the same issue...
Again, this isn't a big limitation for my use case. That said your answer has certainly given me a greater understanding of other circumstances where it would be very useful. Thank you.
If you shard a RethinkDB table to split it across multiple servers, and then create a changefeed on the table, the database will automatically send changes from both servers. Basically, server management/sharding in RethinkDB is visible to ops people, but is completely abstracted from the application developer. All writes are immediately consistent.
Rethink doesn't provide ACID guarantees, though. If you want to make a change to multiple documents in a table and have ACID guarantees, I'd stick with traditional RDBMSes.
How are all writes immediately consistent? What if two clients make simultaneous writes to each shard? Or more extreme, what if there is a net split and writes continue on both shards?
I suppose I could just ask if you have an architecture document floating around :)
That's a great question. We published a blog post earlier this week explaining why building realtime features into the database is so exciting: http://rethinkdb.com/blog/realtime-web/
It would be trivial to pipe the output of the stream to the `client` yourself. The benefit of that being that rethink doesn't have to auth the clients.
This is a preferred solution because the only efficient alternative is to inspect every insert or update, manually determining which clients care about those changes.
And what happens to the fact that the data has changed if the client becomes disconnected for whatever reason? Is the fact that the data has changed committed with the data atomically. Is the query persistent across connections? Or does it end in tears?
Changefeeds are currently bound to a database connection, so they will get terminated if for example the application server goes down or there's a networking issue.
We feel like for many queries (especially the ones you find in web applications) that's not a big deal, since you can efficiently re-run them after reconnecting. In other cases it definitely matters, and we are going to add what you describe in a future release. You can follow the progress (or chime in if you like) at https://github.com/rethinkdb/rethinkdb/issues/3471 .
So obviously that would be application specific. In the case of an exchange when a client connects they get a snapshot of the whole market and then begin receiving streaming updates that took place after that. I generally prefer ascending event ids over timestamps but either approach is workable. The logic about which id to start with can be managed by the server in a socket implementation or on the client when polling. So, disconnects/reconnects for that type of application are handled fairly elegantly.
"
An alerting application is used to notify users when new content is available that matches a predefined (and usually stored) query. MarkLogic Server includes several infrastructure components that you can use to create alerting applications that have very flexible features and perform and scale to very large numbers of stored queries.
"
I am not sure how commits on multiple documents work in RethinkDB, but MarkLogic the reverse query will be executed only after the commit (multiple or single-doc) will have happened (since MarkLogic is ACID).
Another system that supports similar thing is EllasticSearch with query percolation feature
You can get a very basic real-time push in PostgreSQL with per-row triggers and NOTIFY/LISTEN (pg_notify/pg_listen).
The hard part is building out a change feed that lets you synchronize on a subset of a table in a way that is performant, and to make this work with joins (i.e. when you're using the database as more than just a document store).
What are the plans for future official driver support? Is there a community Swift driver in development, or will that be an official driver at some point?
Looks awesome really want to use this on my next couple of projects.
After the 2.0 release (this February) we're going to start bringing more drivers under the official umbrella. I don't have specific plans to share, but there will definitely be more officially supported drivers.
I'm not sure about Swift. If anyone wants to give building a Swift driver a try, check out the "Contribute a driver" section at http://rethinkdb.com/docs/install-drivers/. One is pretty easy to build, and is a lot of fun!
We're generally open for this. However it's a bit of extra work since RethinkDB doesn't currently have a pluggable storage backend API like MongoDB or MySQL do.
Yes. When you're getting initial data you'll get a document of the form `{ new_val: data }`. When you're getting changes, the document is of the form `{ new_val: data, old_val: data }`. Note that in the former case, the `old_data` field is missing.
If you're interested in joining, you can RSVP here: http://www.meetup.com/RethinkDB-Bay-Area-Meetup-Group/events...