amunra__'s comments

amunra__ · on July 30, 2024

Article author here..

Instrumenting would have been 100% just as feasible. Ironically it would have been more work.

For context, our DB is highly optimised for ingestion (millions of rows / second) and adding any high resolution metrics there would impact performance, so would have to be either ripped out afterwards or engineered very carefully (read "not cheap") as to not impact performance.

This stuff took an afternoon, is reusable, and frankly, was more fun to implement :-).

I suspect there are tools out there that that do this stuff, of course. The matter is still finding and learning how to use them compared to writing a few hundred lines of code.

amunra__ · on Sept 20, 2023

Maven, see the rust-maven-plugin we wrote for this. It's opensource.

amunra__ · on Sept 20, 2023

The rust-maven-plugin we wrote indeed also supports JNA for these simple use cases.

https://github.com/questdb/rust-maven-plugin

Compared with JNA, JNI is indeed more complex, but it's faster and has more features. It also solves the problem of calling Java from Rust.

amunra__ · on Sept 20, 2023

We have an initialisation step as soon as we load the jni lib that takes care of this. Given that this gets done before any other threads are started, I don't think there'd be an issue. Good point :-)

amunra__ · on Sept 20, 2023

I'm the author of the blog post.

The focus of the article really is about JNI in Rust.

I see most questions are about "Why did you not use X language instead?", so let me try and address this.

To answer the "Why not just Rust", I should first mention that Rust was still in its early days (before 1.0), and it was a risky bet to choose an emerging language.

The project was started by Vlad (our CEO) who had a background writing high performance Java in the trading space. The Zero-GC techniques - whilst uncommon in open source software - are mature and a staple of writing high performance code in the financial industry. The product evolved organically, feature after feature.

I personally joined the team from a C and C++ background, having previously moved from a project that suffered from minute-long compile times from single .cpp files due to template overuse. Whilst I do miss how expressive high-level C++ can be, Java has really good tooling support.

When writing systems-style software most of what matters in terms of performance is how we call system calls, manage memory and debug and profile. This is an area where Java really shines. Don't get me wrong: In the absolute sense I think C++ tools tend to be better (Linux Perf is awesome!), but Java tooling is _there_. IntelliJ makes it trivially easy to run a test under the debugger reliably and consistently. It's equally easy to run a profiler and to get code coverage. The same tools work across all platforms too, might I add. It's not necessarily better, but it's easier. Turns out that while a little quaint, using Java turned out to be a pretty good choice in my opinion in practice.

Times have moved on. The Rust community really cares about tooling, and it's one of the reasons why we've picked it over expanding our existing C++ codebase: We just want to get stuff done and have enough time left in our dev cycle to properly debug and profile our code.

belter · on Sept 20, 2023

Thanks for the interesting post. Do you plan to maybe use in the future JEP 442? https://openjdk.org/jeps/442

amunra__ · on Sept 20, 2023

When the time is right. There's finally new APIs coming in the Java space that will make native-code interop easier and more reliable.

Our open source database edition can also be used embedded though, so we can only upgrade at the pace of our customers and because of that we still are compatible all the way down to Java 8.

Were it not for this detail, we'd probably consider it a lot sooner.

HdS84 · on Sept 20, 2023

Do you know ravendb? Document db almost entirely written in c#. It's incredible how fast it can get, but as you said it does not look like typical lob code at all

goostavos · on Sept 20, 2023

Do you have any background reading for high performance java or how it's used in the finance world? I had no idea it was used in this niche. As a crufty Java dev who types `new` everywhere and never gives thought to GC, squeaking out high performance sounds like an interesting side of the language.

amunra__ · on Sept 20, 2023

Maybe we should write a blog post, though there's gotta be one out there for this already.

The short of it is: Learn C. Learn your system calls. Learn JNI. Learn about com.sun.misc.Unsafe. Learn about the disruptor pattern. Learn how to pool objects. The long type can pack a lot of data. Go from there!

Yeroc · on Sept 20, 2023

Check out Peter Lawrey's blog [1] has lots of excellent content coming from high performance trading background. Also some older (probably dated) content on Martin Thompson's Mechanical Sympathy [2] blog.

[1] http://blog.vanillajava.blog/

[2] https://mechanical-sympathy.blogspot.com/

lenkite · on Sept 20, 2023

Why did you not consider leveraging Java's recent Foreign Function and Memory API ?

bluestreak · on Sept 20, 2023

One of our distribution channels is Maven Central where we ship Java 11 compatible library. Embedded users preclude us from leveraging latest Java features.

Yeroc · on Sept 20, 2023

FYI, even in the most recent Java (21) LTS release this is flagged as an "early access" feature so you're unlikely to see production applications using it yet.

ahoka · on Sept 20, 2023

So Rust combines the expressiveness of C++ with the ease of development of Java?

amunra__ · on Sept 20, 2023

I'd say expressiveness of C++ with productivity of Java. Rust is indeed not easy to learn.

A little example. Yesterday I was making changes to our C++ client library and I wanted to improve the example in our documentation.

We use a dedicated protocol called ILP for streaming data ingestion and each of the inserted rows has a designated timestamp.

In the Rust example, I using added support for chrono::DateTime and it was trivially easy for me to add a timestamp for a specific example date and time: Utc.with_ymd_and_hms(1997, 7, 4, 4, 56, 55).

Our C++ library instead takes an std::chrono::time_point. I wanted to use the same datetime. As far as I can tell it requires first going through the old C "struct tm" type (which is local and not UTC), then converting to "time_t" then converting to utc via gmtime and then constructing a time_point from that. After 10 minutes the code got too long and complicated so I just substituted it a timestamp specified as an int64_t in nanoseconds.

Don't get me wrong, the C++ time_point is a work of art in how flexible it is, but unnecessarily complicated in most cases.

I should add that I also spent 45 minutes yesterday debugging a CMake issue.

Rust is not easy to learn, but it's just more modern and productive.

C++ is still great if you've got a massive team, but at our scale I don't think it makes any sense.

jandrewrogers · on Sept 20, 2023

Rust development is Java-esque in some ways, I think that is a fair characterization, but the Rust language is noticeably less expressive than C++. The relative lack of expressiveness has been a stumbling block for use in some domains, because modern C++ implementations require a fraction of the code to do the same thing. This disparity doesn't show up for all types of code, so it is not uncommon to see both Rust and C++ used in the same org depending on what the code is trying to do. They have different strengths.

jgilias · on Sept 20, 2023

What makes you say that development in Rust is Java-esque? Genuinely curious!

jandrewrogers · on Sept 20, 2023

Java and Rust have similar goals, you can see it in the design of the language and the ecosystem. I lived in the early Java ecosystem and the Rust ecosystem has a similar vibe. They were both attacking the same problem but were are products of their respective times. The key difference is that Rust was able to learn from Java's mistakes and make ambitious technical bets that would not have been feasible at the time Java was designed. Java was invented when most serious applications were written in C and sometimes early versions of C++. It eliminated much of the conceptual complexity that made it difficult for all but the best developers to be productive in C or C++. In this Java was a massive success, it was easy to scale development even if you didn't have the world's best engineers.

Java's mistake is that they went much too far when they nerfed the language. For highly skilled developers that could write robust C or C++, the poor expressiveness of Java made many easy things difficult or impossible. It was clearly a language designed with business logic in mind, any systems-y software was an afterthought. The release of C++11 ushered in the era of "modern" C++, killing Java's momentum in the systems-y software space.

Rust, in my view, attempts to solve the same abstract problem as Java -- we will never produce enough developers that are competent at writing C or C++. Rust talks about "safety by design" in the same way Java did when it was first released. However, it does so without being so limited that highly skilled software engineers will find the language unusable or giving up so much runtime performance that the operational economics are poor. In my mental taxonomy, Rust is pretty close to what Java intended to be but then never quite delivered on.

akavi · on Sept 20, 2023

As someone who's familiar with Rust, but not at all with C++, can you elaborate on what situations Rust is less expressive than C++ in?

My naïve guess is high performance data structures with reference cycles?

amunra__ · on Sept 20, 2023

I'd say both Rust and C++ trade blows when it comes to expressiveness. You know Rust already, so I'm not going to try sell you how powerful macros can be (see SQLx's ability to compile-time check SQL queries).

And indeed C++ templates are a lot more like Rust macros than Rust generics: They're turing-complete.

Joint with some interesting language choices like SFINAE (substitution failure is not an error), you end up with the ability to specialize functions, methods and whole classes in C++.

You can also have functions that return different types.

C++ templates work like duck-typing within a static language: In Rust you need to say what traits your generics need to support. In C++ it will try and substitute and if it fails (say, because T doesn't support the required methods) it will try another substitution until none are left.

If none of the substitutions work, you will be shown ALL of them in error reporting: This is what leads to pages and pages of compile errors of single-character typos in C++.

Templates are really cool, but also pretty confusing when reading code since you're in a guessing game of what types will fit the constraints imposed by the _implementation_ of the function.

From C++20 there's concepts to make templates work a little smoother.

There's been whole books written about how to abuse templates: these are pre-requisite knowledge when working in large codebases.

jandrewrogers · on Sept 20, 2023

The big one is metaprogramming. Most people that have never really used it grok how powerful (and clean and maintainable) it has become in recent versions of C++. I work on a few different C++20 code bases and the amount of code that is no longer written because it is generated at compile-time with rigorous type safety is brilliant. It goes well beyond vanilla templating, you can essentially build a DSL for the application domain.

Another one, with a more limited audience, is data models where object ownership and lifetimes are inherently indeterminate at compile-time. Since C++ allows you to design your own safety models, since they are opt-in and not built into the compiler, you can provide traditional ownership semantics (e.g. the equivalent of std::unique_ptr) without exposing the mechanics of how ownership or lifetimes are resolved at runtime. Metaprogramming plays a significant role in making this transparent.

Those are the two the matter the most for my purposes. They save an enormous amount of code and bugs. Rust has a litany of other gaps (lack of proper thread local, placement new, et al) but I don't run into those cases routinely.

The data structure thing you mention would be annoying but to be honest I rarely design data structures like this. For performance, most data structures tend to rely on clever abuse of arrays.

kaba0 · on Sept 20, 2023

I wouldn’t call Rust easy to develop in.

shpongled · on Sept 20, 2023

I would as well. I've also placed in the top 100 of advent of code before... using Rust :)

I think once you get familiar with it, it is just slightly slower to write than python.

fleventynine · on Sept 20, 2023

I agree with this statement if my data structures map cleanly to Rust's preferred single-ownership, which most problems in my domain do.

Sometimes I do run across problems that are difficult to express in Rust without resorting to interior mutability, and it can slow me down to figure out the best way to model my data.

shpongled · on Sept 20, 2023

Definitely agree - when I say slightly slower I'm mostly referring to the happy-path/basic uses (reading files, using hashmaps, web servers, etc).

There are definitely aspects of Rust that are much more complex (typically with a tradeoff of more expressiveness, but not always), but at least in my experience, these are usually areas where you can't easily express the same thing in Python. I think many times people forget that you can frequently `clone` your way out of many issues if you are trying to move fast.

And of course areas where I still only use python: manipulating tabular data, making graphs, quick scripts for interacting with APIs etc.

oconnor663 · on Sept 20, 2023

There are idioms you can use where instead of references you use indexes into a Vec or other container. This is normal for folks coming from a gamedev background, but non-obvious to everyone else. Once you get the hang of these idioms, the productivity difference between "object soup" Python and Rust gets smaller, and the resulting code is also closer to what a "production" app would need to look like. This is an extra learning curve for Rust, though, on top of the already famously steep learning curve for the basics.

kaba0 · on Sept 21, 2023

I really don’t like this approach - that’s just pointers without memory safety issues, but you get all the other problems, e.g. use-after-free, without any of the tooling to catch it for you like valgrind.

duped · on Sept 20, 2023

I use it professionally and can't disagree more with this statement.

shpongled · on Sept 20, 2023

I also use it professionally, for both web apps and HPC algorithms.

Are there some things faster to write in python? Sure. But I find the mental overhead is significantly less (for me) in Rust, and overall dev time is about equal, since I typically hit far fewer bugs and spend less time reading docs in Rust. I can't remember the last time I hit a footgun in Rust. Seems I hit one every week in python.

We recently migrated a ~10k SLOC Django JSON API server to axum/sqlx (Rust). I couldn't be happier - faster to ship new features, faster to refactor, fewer bugs, and response times got about 10x quicker.

duped · on Sept 22, 2023

I work a lot on async code with data structures that need interior mutability and it's kind of a pathological case for borrow checking. Everything is effectively wrapped in Arc<RwLock<_>> which adds a bunch of noise to method implementations.

cmrdporcupine · on Sept 20, 2023

It's like skiing (or maybe riding a bike for the first time.) Steep learning curve then becomes somewhat instinctual and fairly routine and trivial and fun. Until you get into the tricky terrain, and then it will put up resistance. But usually for your own good.

neoCrimeLabs · on Sept 20, 2023

Rolling with the analogy, I think learning rust a lot more like snowboarding, and C/C++ is more like learning to ski.

I fell a lot more learning on snowboard than on skis.

Skis had a much quicker early learning curve, and made me feel over-confident. Several times I found myself on trails too steep for my skills, and the skis made me have to work hard to recover. Most of the techniques I learned as a beginner didn't work beyond green trails, and blues, blacks all required new, harder, techniques.

With snowboarding, because I fell a lot more early, my confidence slowly grew. Meanwhile the techniques I was learning on the greens, and the tool in general, that were HARD to learn on the greens were actually EASIER on the blues, and continued to work on the black-diamonds. Granted I also had to learn new techniques on the harder trails, but the beginner techniques and the slower development made me MUCH more confortable across the whole mountain much faster than skis.

Double-blacks are more like unsafe rust. :-D

cmrdporcupine · on Sept 20, 2023

Yeah, fair, I often make the same point about snowboarding vs skiing. Skiing is "easy" to pick up but difficult to master. Most people on the hill are just backseating it down in bad form, like they learned in their first week. Getting good form is ... a lifelong effort.

Snowboarding is brutally hard to pick up at first, unless you hate your tailbone. I have never really tried, I'm too old for that kind of pain.

But I'm a telemark skier, so the worst of both worlds :-) Maybe telemark is like advanced C++. But probably more like Haskell.

arethuza · on Sept 20, 2023

Isn't that the same for most languages once you move beyond vanilla Java/C#/Go etc.?

cmrdporcupine · on Sept 20, 2023

Rust has "opposite" call semantics from what most people are used to, trained since they were new programmers. Sort of. It takes a while to think like the borrow checker and get used to the way arguments get passed around. It's like using C++ and doing std::move for every non-reference argument.

alternatex · on Sept 20, 2023

Sounds like the transition from OOP to functional. It feels opposite before you become comfortable.

mcronce · on Sept 20, 2023

I would

amunra__ · on March 22, 2023

All the library does is facilitate the build process.

I think there's possibly some automation that can be added once that whole stuff stabilizes properly. From Rust it's already possibile to automate exposing a C header via cbindgen, so it wouldn't be too hard.

JNI ultimately still provides the most complete set of capabilities (e.g. calling back into Java).

For the rust-maven-plugin, we've accepted a PR to build binaries and we'll continue accepting PRs for any future enhancements such as the one you're mentioning.

amunra__ · on March 21, 2023

Hello,

Here at QuestDB we're beginning to write part of our database code in Rust as JNI native code extensions. Since all of our Java code is built with Maven we found we needed an easier way to invoke `cargo` during our normal build cycle. The plugin makes it easier and has a few extra features such as running tests.

If you also have a Java code base you're thiking of extending with Rust, we hope you can find this project useful.

amunra__ · on March 9, 2023

Hi, I'm the original author of the QuestDB Python client library and benchmark.

It all started when we had one of our users needing to insert quite a bit of data into our database quickly from Pandas. They had a dataframe that took 25 minutes to serialize row-by-row iterating through the dataframe. The culprit was .iterrows(). Now it's a handful of seconds.

This took a few iterations: At first I thought this could all be handled by Python buffer protocol, but that turned out to create a whole bunch of copies, so for a number of dtypes the code now uses Arrow when it's zero-copy.

The main code is in Cython (and the fact that one can inspect the generated C is pretty neat) with supporting code in Rust. The main serialization logic is in Rust and it's in a separate repo: https://github.com/questdb/c-questdb-client/tree/main/questd....

amunra__ · on Feb 21, 2023

Hi, I'm Adam Cimarosti, one of the core engineers at QuestDB.

We built play.questdb.io to make it easy for anyone to try our database. No installation.

There's a Jupyter Lab notebook, data, sample code, queries and graphs.

We'd love to hear what you think.

jzelinskie · on Feb 22, 2023

This is really cool -- congrats on the launch! Similarly, the team over at AuthZed has created a playground for SpiceDB[0], by using WebAssembly and Monaco.

We debated for hours whether or not to go the notebook route. I'm sure y'all did something similar; would you care to share your reasons for going with the notebook?

[0]: https://play.authzed.com

amunra__ · on Feb 22, 2023

We actually do have a web based demo which is https://demo.questdb.io/ preloaded with millions of rows of data.

That one focuses on SQL queries though.

The notebook in https://play.questdb.io/ offers a more rounded experience to try out any and all of our features.

You can use the notebook to try out data ingestion, dropping partitions and more that is simply not possible in a more sandboxed environment.

The other part is that we value our Python users and wanted to provide an example of how to use our database in conjunction with other tools commonly used in the data science space to slice and dice time series data.

zX41ZdbW · on Feb 22, 2023

The demo does not work at all: https://github.com/questdb/questdb/issues/1525

nhourcard · on Feb 22, 2023

I think this is a little bit harsh; you are pointing to one specific query that does not work out of a dataset of 1.6 billion rows exposed live to the internet and could sustain tens of thousands of users concurrently. In any case, we're grateful for the cto of clickhouse to point those things so we can improve the product further. But generalisations such as "the demo does not work at all" is not a fair comment, nor beneficial IMO.

zX41ZdbW · on March 1, 2023

Sorry, but it does not work for 1.5 years already. Every time I check it does not work.

Cilvic · on Feb 21, 2023

Reading your pitch here, i'd love to have a vague idea what questdb is and why I should care.

amunra__ · on Feb 21, 2023

Most databases store the latest state of something. We don't. We ingest events. After all, life is a function of time :-) The whole world ticks and we take those ticks and store them. If part of your application tracks anything happening over time (trades, ocean pollution levels, ships moving, rocket simulation metrics.. or whatever else then it makes sense to store those events in a time series database. What we provide, primarily, is two basic pieces of functionality: (1) Taking in lots of events FAST. Our ingestion rate is high (and we also integrate with things like Kafka, Pandas -- see the notebook, etc). Each of our time series tables (we support regular ones too) comes with a special timestamp column. (2) Specialized SQL to make sense of data that's changed over time, such as grouping and resampling by time and more. Take a look at our docs for things like SAMPLE BY, LATEST ON, ASOF JOIN, LT JOIN and more. On disk, we also guarantee that all records are sorted by time and this gives us great query performance for these time-based types of queries.

PS. We're also wire-compatible with PostgreSQL.

jmholla · on Feb 21, 2023

I was once in the market for time series databases, but all I could find required down sampling of older data. I don't know if this has changed, and to be fair I haven't been looking for quite some time, but does yours allow for keeping data with the captured precision in perpetuity (or until my hard drive fills up)? My guess from the way you describe your approach is yes, but I wanted to check.

amunra__ · on Feb 21, 2023

Yes. We're pretty good with large volumes of data.

Eventually all local drives fill up though.

When ingesting data we partition data by time. By default we partition by day. This give you the flexibility to detach partitions, store them somewhere slower and cheaper with more capacity for longer term storage and reattach them later if need be.

Built on top of our open source primary product, we also have a cloud variant of QuestDB which runs on AWS. One of the things that we're building there is cold storage. It will automate this process onto S3 such that if a query ever needs to access this older data it will re-enstate it automatically for you with no admin overhead.

doubleg72 · on Feb 22, 2023

Thanks for the detailed reply, I was curious as well. How does this compare with InfluxDB? I was actually looking into a way to store my own financial data of US equities for backtesting and experimentation awhile back. I never did get any further than the planning phase but this seems like it would almost be ideal for that use case.

nhourcard · on Feb 22, 2023

[One edit, adding one additional paragraph at the end]

Note that I'm one of the co-founder of QuestDB, but let me try to be as objective and un-biased as possible. Under the hood, InfluxDB and QuestDB are built differently. Both storage engines are column-oriented. InfluxDB's storage engine uses a Time-Structured Merge Tree (TSM), while QuestDB uses a linear data structure (arrays). A linear data structure makes it easier to leverage modern hardware with native support for CPU's SIMD instructions [1]. Running close to the hardware is one of the key differentiators of QuestDB from an architectural standpoint.

Both have a Write-Ahead Log (WAL) that makes the data durable in case of an unexpected failure. Both use the InfluxDB Line Protocol to ingest data efficiently. Hats off to InfluxDB's team, we found the ILP implementation very neat. However, QuestDB's implementation of ILP is over TCP rather than HTTP for performance reasons. QuestDB is Postgres Wire compatible, meaning that you could also ingest via Postgres, although for market data it would not be the recommended way.

One characteristic of QuestDB is that data is always ordered by time on disk, and out-of-order data is dealt with before touching the disk [2]. The data is partitioned by time. For queries spanning time intervals, the relevant time partitions & columns are lifted to memory, while others are left untouched. This makes such queries (downsampling, interval search etc) particularly fast and efficient.

From a developer experience standpoint, one material difference is the language: InfluxDB has got its own native language, Flux [3], while QuestDB uses SQL, with a bunch of native SQL extensions to manipulate time-series data efficiently: SAMPLE BY, LATEST ON, etc [4]. QuestDB also includes SQL Joins and time-series join (ASOF Join) popular for market data. Since QuestDB speaks the postgresql protocol, developers can use their standard Postgres libraries to query from any language.

From a performance perspective, InfluxDB is known to struggle with ingestion and queries alongside high-cardinality datasets [5]. QuestDB deals with such high cardinality datasets better and is particularly good at ingesting data from concurrent sources, with a max throughput can now reach nearly 5M rows/sec on a single machine. Benchmarks on TSBS [6] with the latest version will follow soon.

InfluxDB is a platform, meaning that they provide an exhaustive offering around the database, while QuestDB is less mature. QuestDB is not yet fully compatible with several tools (say a dashboard like metabase for example), as some popular ones have been prioritised instead (Grafana, Kafka, Telegraf, Pandas dataframes). The charting capabilities of InfluxDB's console are excellent, while QuestDB users would mostly rely on Grafana instead.

[Adding this via post edit #1] One area Influx currently has edge is storage overhead. QuestDB does not support compression yet. Time-series data can often be compressed well [7]. Chances are QuestDB will use more disk space to store the same amount of data.

Hope this helps!

[1] https://news.ycombinator.com/item?id=22803504 [2] https://questdb.io/blog/2021/05/10/questdb-release-6-0-tsbs-... [3] https://docs.influxdata.com/influxdb/cloud/query-data/get-st... [4] https://questdb.io/blog/2022/11/23/sql-extensions-time-serie... [5] https://docs.influxdata.com/influxdb/cloud/write-data/best-p... [6] https://github.com/timescale/tsbs [7] https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Wonnk13 · on Feb 21, 2023

So I guess it would be fair to say you compete with Timescale and Clickhouse as a timeseries database?

nhourcard · on Feb 21, 2023

yes correct - although Clickhouse is more of an OLAP database. Timescale is built on top of Postgres, while QuestDB is built from scratch with Postgres wire compatibility. You can run benchmarks on https://github.com/timescale/tsbs

nwsm · on Feb 21, 2023

I take it you did not visit the link?