FoundationDB 6.0 released, featuring multi-region support

monstrado · on Nov 19, 2018

Congrats to the team on the release. Using FoundationDB has been one of the most rock solid NoSQL experiences I've ever had, and I've used a lot. After having a few months to hammer my cluster with fairly low level atomic operations, I can confidently say this thing holds up to pretty much anything you want to throw at it. Coming from the land of HBase and DynamoDB, It's ability to automatically (and intelligently) repartition data based on write throughput has been an administrative breakthrough for me.

Looking forward to additional use cases I can throw at this beast of a system.

Kudos to you guys!

jinqueeny · on Nov 20, 2018

Any chance of writing up a case study in detail to share your experience and best practice, especially the comparison with other DBs?

th1nkdifferent · on Nov 20, 2018

Could you add some details about your scale?

monstrado · on Nov 26, 2018

Hey, I've written some comments regarding the use case in the past you can see here:

https://news.ycombinator.com/item?id=18305446

ryanworl · on Nov 19, 2018

To anyone who is on the fence about putting FoundationDB into production (or at least evaluating it for their use cases), what is the number one thing you think is missing or you're worried about?

i.e.

- a SQL interface,

- pre-packaged data structure libraries,

- monitoring,

- limitations of FoundationDB itself,

- etc.

I'm working on a talk for the upcoming FoundationDB Summit and I'd love to address some real-world questions or issues people have.

dividuum · on Nov 19, 2018

One of my constant pain point with all distributed data stores is that it's really hard to find out how they behave if something breaks. Be it the network, local storage and so on. How do I find out what's wrong? Are there guides on how to fix a problem? What happens if I lost more nodes than required to automatically recover? How does backup and restore work? Any estimates on how long a restore will take? Are there failure modes that I should monitor for that might be non-obvious? This is mostly the operations side, but personally I would never use something I don't understand enough to have a good feeling of how the system works beneath the shiny surface.

And of course there's the application side: With SQL and EXPLAIN, I can usually see bottlenecks. I have a latent fear that performance with distributed systems suddenly tanks if some structure is suddenly split across nodes for example.

zzzcpan · on Nov 19, 2018

Yeah, operations is something nobody talks about.

realreality · on Nov 19, 2018

- transaction size and duration limitations. I can almost understand the limitation on large write transactions, but the same size limitation applies to read transactions. If you’re doing a large range read, you may not know whether your range will reach the 10MB limit, and thus raise an exception.

- the storage backends seem less impressive than the marketing leads you to believe. The default memory backend is obviously too limited to use in production, and the “ssd” backend turns out just to be built on top of the Btree code from SQLite. Besides that, the documentation warns against using the ssd backend on macOS. Isn’t that a bit strange, considering who owns foundationdb??

- while testing, I found that it was impossible to shrink a cluster. If you add a second storage node just to test that the distributed stuff works correctly, you can’t reduce it back to a single node without destroying the entire database and starting over. If it’s possible to run everything on one node, it should be possible to shrink a cluster back to a single node.

- the storage backends have a crazy amount of write amplification (something like 3x, according to the docs). The foundationdb folks should focus on improving the underlying storage, for instance by building on lmdb or RocksDB or something. For my toy app, I abstracted my data access to use either lmdb (for local testing) or foundationdb (for production), but I ultimately ended up just using lmdb because I didn’t want to deal with fdb’s limitations and operational unknowns.

- another weird fdb limitation: the best single threaded latency you’ll get is supposedly around 1ms for small reads. The docs suggest you can achieve much better performance by scaling the cluster and number of clients. That may be true, but some applications may want high single-threaded performance. (Something like lmdb can achieve tens of thousands of reads per second)

jared2501 · on Nov 20, 2018

On shrinking a cluster: you'll want to use the fdbcli to "exclude" nodes. Should be pretty straight forward (search the docs for the word "exclude").

On write amplification: a factor of 3x is not actually that unusual. The default RocksDB size amplification is 2x, and I've seen performant LSM trees with about 3x write amplification.

On the single threaded bottleneck: this is an inherent issue you have when you put your database over a network connection. LMDB can do 10k/100k+ reads/sec on a single thread since it's just doing syscalls. As soon as you start to need to distribute your database across more than 1 machine you start to need to parallelize you work for high throughput.

ddorian43 · on Nov 20, 2018

Scylladb/redis can also do a lot of calls with single thread/core.

ryanworl · on Nov 20, 2018

FoundationDB single-core performance is fine. From my testing on the memory engine (and the docs), you can expect 70k+ reads/second/core for small keys and values. But crucially this means you must have concurrency to drive throughput.

No database can magically make your serial access pattern faster. Amdahl's law and all that.

FoundationDB's latency for your specific workload is up to how good you are at designing your algorithm for concurrency. If you do every step serially, you'll be spending most of your time waiting for the network.

ryanworl · on Nov 19, 2018

Regarding your first comment, the reason I’m listed as a contributor for this release is I made a change to the documentation about large range reads. Basically, value sizes are not included in the 10mb limit for reads.

realreality · on Nov 19, 2018

Ah yes, I just noticed that in the docs. That’s a good thing to note, though you could still run into the problem with very large ranges (maybe reading 1 million keys is a rare use case?).

ryanworl · on Nov 19, 2018

Reading a million individual keys would be quite rare I would guess, but that isn’t really the issue for a large range. The keys at the start and the end of the range are what’s counted in that case. So if you read the range A-Z, the size is only those two keys A and Z, not the size of keys in between.

More relevant for the current storage engines (although changing in a future storage engine from digging through the code and the abstract for an upcoming talk) is the five second transaction duration limit. That’s just because the multi-version data structure only includes the last 5s of versions.

realreality · on Nov 19, 2018

Oh, that’s even less-expected behavior! In that case, one would never run into the size limitation for range reads. I think the docs should clarify that only the first and last keys count toward the transaction size.

Yes, the 5 second limit could be a problem.

ryanworl · on Nov 20, 2018

There are two reasons for this I can see:

1) All mutations and key ranges are stored locally and submitted when the transaction commits. Lots of data to transfer.

2) the optimistic conflict checker can process one range (even if it is a lot of underlying keys) a lot easier than each individual key in that range

Meai · on Nov 19, 2018

I havent tested that it does work as described but the docs say that the cluster stops working entirely unless the cluster size corresponds to the configured replication. Seen here: https://apple.github.io/foundationdb/configuration.html#conf...

Look under "double" mode or "triple" mode. Is this why it maybe didn't work for you?

realreality · on Nov 19, 2018

I’m not sure if you can switch from single to double, and then back to single. I don’t remember if this configuration was available when I was testing fdb, or if you just increased the number of processes in order to scale the cluster.

I do remember getting into a state where the status said it was migrating data, but there was no available node to migrate it to (because I wanted to shrink the cluster). Effectively, the cluster was deadlocked.

Meai · on Nov 20, 2018

Supposedly it works, as described a section below: https://apple.github.io/foundationdb/configuration.html#chan...

But yeah, the entire point of my comments in this thread is that the database should be telling me exactly what my options are at any situation if there are any. Judging by your comments it seems that you have also encountered a silent "deadlock" and the database gave no indication of what the hell was going on. That's the key here: The database silently stopped working, right? For something as critical as a database with possibly very important data, this just isn't acceptable to me. I want to be told as if I'm a complete noobie user what I have to do and why and what is going on with my data. The database is not a place where I feel the need to put on my smartypants hat, it's where I want to be taken care of completely.

ex3ndr · on Nov 19, 2018

We are actually spend two weeks rewriting our code from SQL to FoundationDB (we have built our own simple ORM).

One of the huge problems is a lack of consistency constraints. Since during development data make became broken in various ways (and we have done several times). And right now there are no way to implement simplest constraints on database level.

ryanworl · on Nov 19, 2018

Could you give an example of what you mean? This is very relevant for my talk.

ex3ndr · on Nov 19, 2018

I mean we have built a nice ORM for FoundationDB, but eventually we found several bugs in it and we just didn't write to our indexes correctly or put several indexes in the same subspace. Then we started to migrate our data (we were so crazy that moved large chunk of our production data already!) and we found that our migrators are also have problems. Basically our data was not consistent at all.

There are almost no easy way to ensure that everything is OK.

Also I have fear that we probably deleted something and there are no way to prohibit modification of certain keys.

lykr0n · on Nov 19, 2018

My blocker is that the documentation is confusing and not very straight forward. The tutorials are not very verbose and the code (at least for python) is not written in a way that is easy for someone to understand.

I'd be very interested in usage for time-series, but it looks like to get a working example up I'd need to fully parse the documentation and tutorials in order to do it as there is no "Let's build a Timeseries Database on FoundationDB from scratch"

monstrado · on Nov 19, 2018

Have you given https://apple.github.io/foundationdb/time-series.html a read? Since FoundationDB is lexicographically sorted it's pretty straight forward to keep things sorted for reading data in chronological order.

For example, if you want to read "last 5 minutes" of data, you can keep your timestamp at the end of the key in reverse order (Long.MaxValue - timestamp).

lykr0n · on Nov 20, 2018

Yeah, but that doc assumes you've read and understood everything else in the docs. That document provides no help in understanding the database, it's concepts, or how to implement them. I could try and reconcile that doc with the python tutorial and everything else, but I just want something where I can copy and paste and it works.

FoundationDB doesn't have a batteries included documentation like Clickhouse does: https://clickhouse.yandex/tutorial.html

amirouche · on Nov 19, 2018

Deployement strategies guide from single node cluster to single datacenter to multi region. And how version upgrades happens

DLA · on Nov 19, 2018

+1 for a SQL layer/interface. Would massively increase utility of FDB.

htfy96 · on Nov 19, 2018

Haven't tested yet, but I was wondering if it's possible to use TiDB as SQL layer. [0]

[0]: https://github.com/apple/foundationdb/issues/198

amirouche · on Nov 19, 2018

https://forums.foundationdb.org/t/sql-layer-in-foundationdb/...

whatshisface · on Nov 19, 2018

I remember this being talked about since before the fist FDB release. I think the only reason we don't have one yet is because SQL is not nearly as cool as everything else in Foundation. ;)

zzzcpan · on Nov 19, 2018

Have you tried CockroachDB, TiDB? Or why do you think SQL could increase utility? It would make things slow and unpredictable, something distributed databases already struggle with.

DLA · on Nov 20, 2018

I am a huge fan of CockroachDB. Using it in two projects currently and it’s been fantastic. Having a SQL interface to FDB would be the glue needed to integrate FDB into many domains as a viable solution. FDB has solid k/v interfaces now, handles clustering and failures, adding a SQL layer increases its potential utility and lowers the bar for many users. To your point about slow and unpredictable sure that can be an issue - perhaps a limited SQL interface INSERT, UPDATE, DELETE, SELECT to start. Make it dirt simple to do CRUD ops.

continuations · on Nov 20, 2018

- Secondary Index

- some way to store ordered records to speed up range query

- comparison & benchmarks of FoundationDB vs other distributed DB like TiDB, CockroachDB, ScyllaDB, etc

ryanworl · on Nov 20, 2018

Those first two are pretty trivial. Keys are already stored in sorted order in a b-tree. Secondary indexing involves putting the indexed value to the left of the primary key in the key you write.

continuations · on Nov 20, 2018

> Keys are already stored in sorted order in a b-tree

I thought FoundationDB stored data in a hash table. Didn't realize it uses a b-tree. Thanks for the clarification.

> Secondary indexing involves putting the indexed value to the left of the primary key in the key you write

By secondary index I meant having a sort order that's different from the primary key sort order.

rbranson · on Nov 20, 2018

FoundationDB provides the building blocks. Secondary indexes can be built on top of the regular primary index. Transactions make it possible to maintain a consistent 2I using a K/V interface.

tensor · on Nov 19, 2018

We use Postgresql heavily and it's not really a viable replacement. Even if there were an SQL layer, I doubt it would be fully featured enough for our needs. We may evaluate FDB as a distributed cache, but haven't had the time.

riku_iki · on Nov 19, 2018

Curious if it is possible to query fdb as fdw from psql..

ddorian43 · on Nov 20, 2018

Should be able. Yugabyte is doing something similar.

zenbai · on Nov 20, 2018

I am considering FoundationDB for a SaaS project. Here are some questions I have:

1. What is the recommended way to filter on multiple indexes?

2. What is the recommended way to filter on one or more indexes and sort on another or more index(es)

3. If I use FoundationDB as my main datastore, how should I implement full text search?

4. From what I understand, FoundationDB stores keys sorted globally. How does it handle hot spots where a range of keys that is very frequently accessed is on one machine that gets overloaded?

Edit 5. How do you do backups?

Meai · on Nov 19, 2018

To me it's several things: 1. I don't know how stable foundationdb actually is. You say it's tested and very performant but simply starting fdbmonitor on localhost with two processes and then starting fdbcli, it takes 5.6 seconds to get a connection. Typing 'status' takes even longer (before I have a database configured). Why does it take so long for me to get back an error? What could possibly justify that kind of delay? I know I'm saying the word 'should' a lot of times but in my mind it should be practically instant, especially if there is no network trip involved and you know all the servers that I configured. Secondly, the errors are scary to me. "The coordinator(s) have no record of this database. Either the coordinator addresses are incorrect, the coordination state on those machines is missing, or no database has been created." The error "The database is unavailable" is equally scary. I don't even have a database yet but sometimes I feel like I did create one and then I imagine myself getting that error in production because of some simple or not so simple mistake and I think:

What would I even be doing if the coordinator state is missing? That sounds like a critical internal issue to me, is this really something I need to find out and if it is, why does the error not tell me how to check this or repair it?

Another thing: Right after I create a database with 'configure new single memory', I get in 'status': " Moving data - unknown (initializing) Sum of key-value sizes - unknown " Is this really such an edge case that the database cannot give me meaningful information on what is going on with my data right now? That is very scary to me. It should be able to tell me what is going on with the data, this is the simplest possible moment. I just created a database 5 seconds earlier and I run a status command. Now the database is apparently in an unknown state. What I would like to get is a realtime progress bar of where data is moving and why, with percentage numbers indicating the progress and if there is any error it should actually tell me the location of the log file.

The other big question mark is: I dont know if I can truly configure everything correctly so I actually get the performance that is advertised, there seem to be a lot of things I need to consider and configure exactly the right way in order to get the correct performance as I've read on the forums, from the number of processes to start, to all kinds of other configuration issues. Somebody actually made a guide and I'm thankful for this but in my opinion I should not need to worry about any of this, foundationdb should scan what CPUs/disks I have and suggest the optimal solution/configuration. These are simple rules, shouldnt they be automatically done? Imagine how many people are running foundationdb suboptimally just because they didnt know that the default naive way of starting a few processes and letting them do some work is horribly wrong.

The third issue is: What matters is how the performance will look after I did all my fancy custom layers on top and when I had to emulate all the stuff that was builtin by mongodb and whether this thing is actually as reliable and performant as it says it is. When I look at the issues I enumerated in the beginning, it really doesn't feel good to me.

whateveracct · on Nov 19, 2018

> - monitoring,

What is the state of production monitoring of FoundationDB?

ryanworl · on Nov 19, 2018

FoundationDB exposes extremely detailed XML logs, as well as providing a status key you can scrape (\xff\xff/status/json) which exposes high level metrics about the entire cluster, including health, disk space, transaction rate, conflict rate, etc.

whateveracct · on Nov 19, 2018

Thanks for the direction. I'll read up on it! FoundationDB seems very cool :)

amirouche · on Nov 19, 2018

Here is the relevant thread in the forum https://forums.foundationdb.org/t/what-do-you-monitor/184?u=...

Redoubts · on Nov 19, 2018

It feels like the 80% Lower Receiver of databases, so more layers like an SQL layer or something similarly powerful would be ideal.

qaq · on Nov 19, 2018

SQL interface

Nullabillity · on Nov 20, 2018

Trust. Apple already killed it once, just after they bought it. How am I supposed to trust that they wouldn't do so again?

A SQL interface might lessen the impact of that somewhat, but then I might as well just use Postgres from the start.

statictype · on Nov 20, 2018

It's open source now right?

Nullabillity · on Nov 20, 2018

Sure. That doesn't mean they couldn't do a lot of damage by taking down the documentation, official repo, and download mirrors.

Again.

Rafuino · on Nov 19, 2018

Shameless plug here, but if anyone wants to benchmark in-memory vs. NVMe NAND SSD vs. NVMe Intel Optane DC SSD performance, we're looking for someone with FoundationDB expertise to give it a shot and share their learnings with the community. Make a request for a server by posting a new issue at our Github page [1].

Basically, I'm curious to know how FDB's memory engine performs compared to the SSD engine with a standard NAND SSD and an Intel Optane DC SSD. Something along the lines of the throughput per core and latency results on the FDB performance page [2].

[1]: https://github.com/AccelerateWithOptane/lab/issues [2]: https://apple.github.io/foundationdb/performance.html

Disclosure: I'm working at Intel and help manage our open source lab with our friends at Packet.

whitepoplar · on Nov 19, 2018

Only tangentially related, but are there any public Postgres benchmarks for systems running high-CPU, high-memory, p4800x optane drives?

Rafuino · on Nov 19, 2018

Would something like the TSBS [1] help with this? It's TimescaleDB but they're built on Postgres. They have built-in high-CPU queries, but I haven't seen high-memory before. Can you point me in the right direction? Otherwise, we've had some Postgres people use the lab and are waiting on their decision whether to share publicly.

[1]: https://github.com/timescale/tsbs

throwaway5752 · on Nov 19, 2018

Anyone know how they're using it at Apple vs other distributed databases?

jared2501 · on Nov 20, 2018

A mate of mine is ex-apple and said they use it for icloud.

throwaway5752 · on Nov 21, 2018

Yes, but I understand they use Cassandra heavily https://www.techrepublic.com/article/apples-secret-nosql-sau...), and I was curious about why they use FoundationDB in some settings vs Cassandra in others. I can imagine a few good technical reasons to use one or the other depending on the context, but figured I'd ask in case anyone knew.

the_duke · on Nov 19, 2018

So, FoundationDB is a pretty low level distributed key-value store with transactions.

Most applications will need something higher level, like a SQL or document db frontend which could be built on top.

I'm curios what people have started using FoundationDB for. Any interesting stories to share?

ex3ndr · on Nov 19, 2018

We have moved to FDB for our messaging platform. We had several options:

a) Rewrite SQL code. In our case we are using Node.JS and all SQL libraries are very very slow. Even replacing one with another is enormous work. b) Rewrite to a new language. It was also an option since querying Postgres can take 1ms, but parsing response can easily take 100ms+. That trashed our event loop and causes awful latency. c) Rewrite to high performance NoSQL database.

We picked a last one. In context of a Node.js we were able to write really thin layer on top of FDB that works super-fast and in a way we needed.

In my previous startup we eventually ditched all SQL from our codebase too since SQL databases is just too slow for low latency messaging apps. There are no simple way to shard data, there are always random locks around your database (which blocks connections). Locks are really hard to debug sometimes. How to scale single SQL server? All of this is doable, but in FDB it was basically free.

We migrated to FDB and got almost x100 improvement in latency/performance. And unlike SQL-code that was very carefully crafted we can do nasty things. Like - "hey, let's just pull this key every 100ms and check for a new value" or "hey, let's do it on 10s of instances at the same time?". In this situations Postgres started to consume all available CPU. You can easily creep out SQL with a single instance of your app. We haven't managed this to do with FDB for 1/2 of the cost. We are often in situation when someone commit something with a bug and, for example, started to pull data every millisecond in N^2 streams where N is number of online users. In this situations we just can't see any impact at all on our platform. Just spikes in monitoring.

FDB is wonderful thing - it allows you to forget about optimizing performance of your queries, forget about managing backups and replication. It just works!

tshannon · on Nov 19, 2018

100ms+!

Ouch. Is that just large amounts of data? Are you guys using pg.native?

ex3ndr · on Nov 19, 2018

No, this is just say a list of messages, not that much of data. We tried to move to pg-native, but it didn't help. Problem was in Sequelize. But in my internal tests even Sequelize was the fastest library on the market.

th1nkdifferent · on Nov 20, 2018

Looks like you’re blaming Postgres for Something that sounds like Sequelize’s fault. You should try prototyping parts of your application in a language that is better supported. Last I used Sequelize I was disappointed at how poorly it fared compared to other libraries like Django ORM or SQLAlchemy.

amirouche · on Nov 19, 2018

> like a SQL

Most people don't mind SQL. RDBMS are good enough. People don't want to learn a new thing that's why it's still here and that's why FDB will have a hard time to compete with CockroachDB or TiDB or Spanner.

> document DB frontend which could be built on top

A document database is a low hanging fruit in FDB (except if you want 1-to-1 mapping with MongoDB). But my advice is to stay away from MongoDB API.

What FDB is offering is much more powerful. Think the ease of use of MongoDB with the power of dynamodb + other niceties likes 'watches' (pgsql-like notify), transactions of course and your favorite language to do the queries. Also, FDB works in standalone mode that is you can start using FDB on day one and grow your business with it. All you need is a good layer.

> Any interesting stories to share?

I have been dabbling with key-value stores for 5 years now. So I am definitely biased. Simply said, key-value stores open perspectives you can not imagine, FDB, in particular, is a great great idea and it looks like based on the forums interactions that it's a good (if not fabulous) piece of software.

amirouche · on Nov 19, 2018

Read on the forums for more (success) stories on forum https://forums.foundationdb.org/ tl;dr: mostly timeseries, but (also there is long thread about a distributed task queue using FDB but I am not sure it's in production, yet)

ex3ndr · on Nov 19, 2018

If someone want to chat about FoundationDB and want to ask about our (while limited) experience building messaging on top of FDB, please feel free to join our small room: https://app.openland.com/joinChannel/updnSlD

romed · on Nov 20, 2018

The upgrade instructions are firmly in the lolwut category. The database is just not available during the transition. I think they should focus on fixing their protocol such that multiple versions can run at the same time, and rollbacks are possible.

discoball · on Nov 20, 2018

How does FDB compare to Spanner as far as the Consistency model and trade offs?

ryanworl · on Nov 20, 2018

FoundationDB and Spanner both offer external consistency. Spanner does this through synchronized clocks. FoundationDB has a similar clock called TimeKeeper, which is not a clock per se but a counter which advances approximately 1M times per second. Transactions are ordered based on this timestamp.

discoball · on Nov 20, 2018

With a Lamport Clock (counter, logical clock) you could end up with the following due to dependence on conflict resolution aka optimistic MVCC rather than Wall Time (copies do from another HN):

“it is possible for a transaction C to be aborted because it conflicts with another transaction B, but transaction B is also aborted because it conflicts with A (on another resolver), so C "could have" been committed. When Alec Grieser was an intern at FoundationDB he did some simulations showing that in horrible worst cases this inaccuracy could significantly hurt performance. But in practice I don't think there have been a lot of complaints about it.”

ryanworl · on Nov 20, 2018

Yes, I think is fairly well known in the optimistic family of concurrency control algorithms you can get into situations where aborts are not necessary.

https://db.cs.cmu.edu/papers/2016/yu-sigmod2016.pdf

Section 3.4 of this paper covers another example of this.

thefounder · on Nov 20, 2018

Is there any high level api/lib like we have for Google Datastore?