I suspect Go is probably better, but as a long time C# developer I cringe at the idea of implementing a DB with GC language. It seems that you would be fighting the GC all the time and have to write lots a lot of non-obvious low allocation code, using unmanaged structures, unsafe, etc., a lot. All doable of course, but seems like it would be starting on the wrong foot. Maybe fine for a very small team, but onboarding new devs with the right skill set would be hard.
There are quite a few database products and other data intensive systems written in Go, Java, and many other languages. Generally this is much less of an issue than you think. And it's offset by several benefits that come with having some nice primitives to do e.g. concurrency and nice language to work with.
On the JVM you have things like Cassandra, Elasticsearch, Kafka, etc. each of which offer performance and scale. There are lots more examples. As far as I know they don't do any of the things you mention; at least not a lot. And you can use memory mapped files on the JVM, which helps as well. Elasticsearch uses this a lot. And I imagine Kafka and Cassandra do similar things.
As for skillset, you definitely need to know what you are doing if you are going to write a database. But that would be true regardless of the language.
While it is true that Cassandra and Kafka are great software that countless developers rely on to handle massive scale...
It is also true that the JVM and the GC are a bottleneck in what they are able to offer. Scylla and Redpanda's pitch is "we are like this essential piece of software, but without the JVM and GC".
Of course, having a database written in Go still has its pros and cons, so each to their own.
The JVM and GC have a learning curve for the people implementing the database. But most users wouldn't get exposed to any of that. And if you manage it properly, things work fine. I've used Elasticsearch (and Opensearch) for many years. This is not really an issue these days. It was 10 years ago when JVM garbage collection was a lot less advanced than it is these days. These days, that stuff just works. I haven't had to tune GC for at least half a decade on the JVM. It's become a complete and utter non issue.
There are many valid reasons to pick other products but Elasticsearch is pretty good and fast at what it does. I've seen it ingest content at half a million documents per second. No stutters. Nothing. Just throw data at it and watch it keep up with that for an hour sustained (i.e. a bit over a billion documents). CPUs maxed out. This was about as fast as it went. We threw more data at it and it slowed down but it didn't die.
That data of course came from kafka being passed through a bunch of docker processes (Scala). All JVM based. Zero GC tuning needed. There was lots of other tuning we did. But the JVM wasn't a concern.
I think this depends on the level of optimization you go for. At the extreme end, you’re not gonna use “vanilla” anything, even in C or Rust. So I doubt that you’ll get that smooth onboarding experience.
In Go, I’ve found that with a little bit of awareness, and a small bag of tricks, you can get very low allocations on hot paths (where they matter). This comes down to using sync.Pool and being clever with slices to avoid copying. That’s footgun-performance tradeoff that’s well worth it, and can get you really far quickly.
Well, with a manually managed language you have to do those things pretty much all the the time, but with a GC you can pick which parts are manually managed.
Also I suspect this project isn't for holding hundreds of GB of stuff in memory all the time, but I could be wrong.
You would be surprised by performance of modern .NET :)
Writing no-alloc is oftentimes done by reducing complexity and not doing "stupid" tricks that work against JIT and CoreLib features.
For databases specifically, .NET might actually be positioned very well with its low-level features (intrisics incl. SIMD, FFI, struct generics though not entirely low-level) and high-throughput GC.
Interesting example of this applied in practice is Garnet[0]/FASTER[1]. Keep in mind that its codebase still has many instances of un-idiomatic C# and you can do way better by further simplification, but it already does the job well enough.
Using net6. I agree, performance is generally great / just as fast as its peers (i.e. Java and Go). However, if you need to think about memory a lot, GCed runtimes are an odd choice.
Primarily because you not only need to think about memory, but you also need to think about the general care and feeding of the GC as well (the behavior of which can be rather opaque). To each their own, but based on my own (fairly extensive) experience, I would not create a new DB project in a GCed runtime given the choice. That being said, I do think C# is a very nice language with a high performance runtime, has a wonderful ecosystem and is great choice for many / most projects.
Isn't the characteristic of languages like C++ or Rust is you have to think way more about managing the memory (and even more so if you use C)?
Borrow checker based .drop/deallocation is very, very convenient for data with linear or otherwise trivial lifetime, but for complex cases you still end up paying with either your effort, Arc/Arc<Mutex, or both. Where it does help is knowing that your code is thread-safe, something Rust is unmatched at.
But otherwise, C# is a quite unique GC-based language since it offers full set of tools for low level control when you do need that. You don't have to fight GC because once you use e.g. struct generics for abstractions and stack allocated/pooled buffers or native memory for data neatly wrapped into spans, you get something close to C but in a much nicer looking and convenient package (GC is also an optimization since it acts like an arena allocator, and each individual object has less cost than pure reference counted approach).
Another language, albeit with small adoption, is D which too has GC when you want an escape hatch while offering features to compete with C++ (can't attest to its GC performance however).
That is the idealy world do, profit from productivity of using automatic resource management, and only do low allocation code on the paths that actually matter, as the ASP.NET team has been doing since .NET Core was introduced, with great results in performance.
yup, it's bad, and even if you "do everything right" minimization wise, if you're still using the heap then eventually fragmentation will come for you too
Languages using moving garbage collectors, like C# and Java are particularly good at not having to deal with fragmentation at all or marginally at most.
In order to move objects you need to stop the world and update everything that will ever point to them: pointers, pointer aliases, arithmetic bases, arithmetic offsets. This quickly becomes intractable. It's not strictly speaking just pointers themselves, but the fact that pointers can be used arithmetically in various ways, even though the most obvious ways are disallowed. The obvious example is unsafe.Pointer and uintptr, but that has some guards, for example you'll get an error converting from a uintptr variable to an unsafe.Pointer, but mix in some slices, or reflect usage and you can quickly get into shaky territory. I believe you can achieve badness even without using the unsafe package.
Interesting. I did not realize it is such a problem in JVM.
EDIT: in Go, not JVM, somehow it is always Go having trouble with systems programming domain.
In .NET, this does not seem to be a performance issue, in fact, it is used by almost all performance-sensitive code to match or outperform C++ when e.g. writing SIMD code.
There are roughly three ways to pass data by reference:
- Object references (which work the same way as in JVM, compressed pointers notwithstanding)
- Unsafe pointers
- Byref pointers
While object references don't need explanation, the way the last two work is important.
Unsafe pointers (T*) are plain C pointers and are ignored by GC. They can originate from unmanaged memory (FFI, NativeMemory.Alloc, stack, etc.) or objects (fixed statement and/or other unsafe means). On an occasion a pointer into object interior (e.g. byte* from byte[], or MyStruct* from MyStruct field in an object) is required, such object is pinned by setting the bit in object header which is coincidentally used by GC during mark phase (concurrent or otherwise) to indicate live objects. When an object is pinned in this way, GC will not move it during relocation and/or compaction GC phase (again, concurrent or otherwise, not every GC is stop-the-world). In fact, objects are moved when GC deems so to be profitable, either by moving to a heap belonging to a different generation or by performing heap compaction when a combination of heuristics prompts it to reduce fragmentation. Over the years, numerous improvements to GC have been made to reduce the cost of object pinning to the point that it is never a concern today. Part because of those improvements, part because of the next feature.
Byref pointers (ref T) are like regular unsafe pointers except they are specially tracked by GC to allow to point to object interiors without writing unsafe code or having to pin the object. You can still write unsafe code with them by doing "byref arithmetics", which CoreLib and other performance sensitive code does (I'm writing a UTF-8 string library which heavily relies on that for performance), and they also allow to point to arbitrary non-object memory like regular unsafe pointers do - stack, FFI, NativeMemory.Alloc, etc. They are what Span<T> is based on (internally it is ref T + int length) allowing to wrap arbitrary memory ranges in a slice-like type which can then be safely consumed by all standard library APIs (you can int.Parse a Span<byte> slice from stack-allocated buffer, FFI call or a byte[] array without any overhead). Byref pointers can also be pinned by being stored in a stack frame location for pinned addresses, which GC is aware of. For stack and unmanaged memory this is a no-op and for object memory this only matters during GC itself. Of course nothing is ever free in software engineering but the overhead of byrefs is considered to be insignificant compared to object pinning.
Java behavior is in general/abstract similar to .net - there are types & APIs to pin memeory for ffi and aliasing use cases but normal references are abstracted enabling the gc to perform compaction/moves for non pinned objects. The behavior I described in the parent post was go, which exposes pointers directly, rather than references.
It is still possible to end up with fragmentation challenges with both the jvm and .net under specific circumstances- but there are also a lot of tunable parameters and so on to help address the problem without substantially modifying the program design to fit an under specified contract with the memory subsystem. In go there are few tunables, and an even less specified behavior of the gc and its interaction with application code at this level of concern.