Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Introducing ReadySet (readyset.io)
190 points by alanamarzoev4 on April 5, 2022 | hide | past | favorite | 38 comments


I've used both of the suggested methods under "Current standards for scaling out databases" so I see where this is coming from. But I peeked at the AWS reference architecture, and it places a Consul and ReadySet deployment in my environment for me to run and maintain. I feel like any sales pitch for this really needs to convince me that having these things in my environments is going to be worth the hassle in terms of milliseconds and dollars, as opposed to just using RDS read replicas and paying a bit more. Then again, I can see this being an obvious choice if you're growing very quickly or have tight latency requirements.

With that said, it looks like cool tech and I read Jon's Rust for Rustaceans which serves as a stamp of quality for this even if I haven't tried it yet!


Digging down a couple layers of links from this, the underlying paper, "Partial State in Dataflow-Based Materialized Views" https://jon.thesquareplanet.com/papers/phd-thesis.pdf is pretty intriguing. It sounds like a potential free lunch in specific performance areas, which means it also sounds too good to be true, but if it turns out to be a metaphorical 90%-off lunch that's still very promising.


Oh hey, that's my thesis! Happy to answer any questions you may have about it :) There's also the OSDI'18 paper here which may be of interest: https://jon.tsp.io/papers/osdi18-noria.pdf


I haven't read much about Noria other than this readme [0], but would like to know if you are familiar enough to contrast Materialize [1] with it in terms of perf, overhead, approach, and fundamental (design) principles?

[0] https://github.com/mit-pdos/noria

[1] https://github.com/MaterializeInc/materialize


Back when we announced Materialize we got the same question in reverse! You can read my response from a few years back here: https://news.ycombinator.com/item?id=22362301

Unfortunately I'm not privy to whatever improvements ReadySet has made in the past two years, so I can't comment on differences between ReadySet and Materialize. Perhaps Jon can, though!


Insightful. Thanks!


This is super exciting. Ever since I talked to you about Noria I've been telling people about this concept. I'm excited to see a production ready implementation of it.


Big fan here!

I've been following the space since a bit of time, and I must say it's exciting. To me this is the future of apps where the Truth lives server-side, and everything reacts from there; With partial state evaluation lowering resource consumption to a minimum.

Kafka Streams and Apache Flink seem to be focused on real-time analytics, and I wish they'd get there to stimulate the space.

Are you affiliated with ReadySet?


I'm pretty excited about it too! I remember when I initially started the research I was amazed that this didn't already exist.

Some context: https://twitter.com/jonhoo/status/1511401461669720068

Basically, I co-founded the company around the time I graduated, but had had my fill of database research after six years of PhD. So I joined AWS to work on Rust while Alana (the CEO) took on leading ReadySet.


According to the linked article, Jon appears to be one of the co-founders


the remaining 10% of the 90%-off free lunch is pretty much just eventual consistency - it can occasionally be the case that you write something to the DB, and an immediate subsequent write doesn't see it. That said, there are escape-hatches there (we'll proxy queries that happen inside of a transaction to the upstream mysql/postgresql database, and there's an experimental implementation of opt-in Read-Your-Writes consistency), and I'd wager that the vast majority of "traditional" web applications can tolerate slightly stale reads.

Our official docs also have an aptly-titled: "what's the catch?" section: https://docs.readyset.io/concepts/overview#whats-the-catch


this is similar to an asynchronously replicated database replica.

Which mean read might return stale data because of the replication lag.

It will also increase the load on the server you read the replication log from.

But the primary database could dump the transaction log to (S3/kafka) and have ReadySet instance read it from there instead of directly from Primary database.

So for a read mostly website (hackernews/reddit) this is indeed a free lunch.


I am really excited to see where readyset can take this technology, if deployment is as simple as they say this sounds like an instant win for any high throughput service.

I am curious how they handle queries that would overflow local main memory, like if I just had a PK lookup on a 10TB table you obviously can't store all that in RAM, and would still need to do some form of cache invalidation.


The trick is "partial view materialization" (https://jon.thesquareplanet.com/papers/phd-thesis.pdf). Basically, you only materialize results for commonly-accessed keys, and compute other keys on-demand.


Is there a way to federate which keys are commonly accessed ? Like if I commonly access the entire table can I direct inbound traffic to different application servers and have them access different caches so each cache can pull only a subset of the data into the cache, and not worry about things like which keys are being written globally


We've thought about that, actually! We have an experimental mode where multiple copies of the same query can be created (actually just multiple copies of the leaf node in the dataflow graph, so intermediate state is reused) with different subsets of keys materialized - the idea is then that these separate readers would be run on different regions, so eg the reader in the EU region gets keys for EU users, and the reader in the NA region gets keys for NA users.


This was recently on my radar for a thing i wanted to do, but what you have might also fill the same gap.

https://materialize.com/


Materialize is advertised more for derived data, i.e. live-updated materialized views, targeted towards analytics. Or, in other words, like writing streaming ETL as SQL. Although, I've been interested in using it as part of the OLTP stack. It seems like that's the angle ReadySet is going for.


I can't think of real-world examples of apps that have read path scaling problem that this tech would solve well. It would be great if the authors can catalogue a bunch of real-world use-cases. (real-world customer case-studies would be even better).

Today, machines are super huge (in terms of compute cores, memory and iops for storage and network) and a single Mysql or PostgreSQL database can do a lot of work. This makes it much much easier to build apps that don't have as many users at Internet scale – that is pretty much all enterprise apps – without resorting to distributed databases.

In Internet scale consumer app domains like e-commerce/delivery or fintech where relational databases are used heavily, most queries would have strict correctness requirements and won't tolerate staleness. Also, most query-results would be highly specific to each user and won't have much cache hits. Also, apps in general are increasingly personalised and have fast changing content.

In terms of technology evolution, I see people moving from single large machine databases to distributed sql databases as their use-cases scale.

And as distributed sql databases mature, I expect they will get built-in capability to generate user-defined materialised views with flexibility to manage their placement w.r.t class and number of machines to compute and serve them etc.


My company is building a realtime analytics service for security. You have a lot of reads that can often be answered with stale data (thanks to our data modeling, which provides ACID 2.0 semantics).

I think a lot of applications could probably benefit from this if they were built with a data model in mind that leverages it properly. But if you 1:1 migrate your code that relies on ACID transactions over to something with Strong Eventual Consistency... yeah, that's gonna be a bad time.


This seems like it could be a good fit for my team (and we have been discussing this kind of caching).

We frequently (~ once per second) run queries over relations that are increasing at a rate of ~100 rows per second (append only, no updates).

Could this cause any performance concerns for ReadySet? How much control do we have over the frequency of reconstruction of cached data based on the flow graph?


This feels like the sort of the where it works great 90% of the time, but as soon as you want to do something more complicated/nontrivial it doesn’t handle it properly and is impossible to debug/improve since you are using an opaque solution.

At least with scaling replicas or having a dumb cache layer it’s easy to understand the system.


I was prepared to be super skeptical here, but this actually looks like it could be really good, without all the gotchas I was expecting. I think the only potential hole I see is if you do a lot of db-side code as stored procedures and what-not. I can’t tell from the write up if they can keep those consistent.


Hi, Jon Gjengset I am so happy to see attempt to make Noria usable in legacy applications. Am I right in assuming this need to consume the replication log from the primary database? Or write request going through ReadSet proxy will produce its own change feed ?


Hi, ReadySet engineer here. It does replicate using the replication log.


Logical decode is the untapped frontier of postgresql IMHO.


Oh this is like litestream that featured in TailScale's post "A database for 2022" discussed recently [1]. In both cases you're replicating by tailing the database WAL, in this case Postgresql/MySql and in litestream's case SQLite. It's a great idea, I'm glad we're starting to take advantage of this great opportunity afforded by the design convergence of using a WAL for database transactions.

[1]: https://news.ycombinator.com/item?id=30883015


I am wondering how the financial dividends from this are set up. Obviously Jon wrote the paper and core technology behind it, and I am hoping he can do well from it.


What are the biggest difference with something like materialize.com?

I love the way you explain what's ReadySet. Congrats on the launch.


> However, if your million-dollar idea ends up being worth even $100K, you’ll likely find your database struggling to keep up with the scale.

Huh? At a 3 year multiple that is less than $3,000 monthly revenue. I'm sure a SaaS can provide way more value for $3k/mo before you start running into scaling issues


Yeah, that's a statement I find it hard to agree with. We have 10k customers, our application is really data heavy – our postgres db is a few TB in size – yet we never hit any scaling problems. Maybe the key was to use managed cloud SQL, but you should do that anyway as an early stage startup, because it's not part of your business to run databases.


Lots of database proxies and caches like ProxySQL or Heimdall Data or IMDG's like Apache Ignite - but it's interesting to see it offered as a cloud service backed with this cache invalidation tech.

Had a similar idea a few years back and it's nice to see it turn real. Congrats on the launch.


Very cool tech here, excited for the future. Congrats on fundraising!


Is it open source?


Sounds great, until you realize you are switching to eventual consistency for your whole application.


I am wondering if one could setup ReadySet on top of Vitness / PlanetScale.


Could you explain the GDPR use case? That seems more like the use case for data residency and not data retrieval?


Can this compile to WebAssembly and run in the browser, in front of SQLite?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: