I developed software system with hundreds of millions to a few billions data entries items in the core object store (C++) and need to very fast access time. There are needs to open/load/read those objects in database in very fast speed (< 0.01 seconds). When I profiled those operations in early version of the code, the biggest bottleneck is always the malloc/free when the system reached > 10 millions records.
To get around those limitation, I end up design the data structures / datastore that eliminated the malloc/free, new/delete and do my own memory manager to map all the info directly to the structure in the file via mmap and do my own sub allocations. With those, I can get everything in < 0.01 seconds constant access time regardless the size of the database. (from startup of the program to return the value to client.)
I think all modern day database or BIG DATA type system app depend on its own custom memory manager. My theory, a language that depend on garbage collection can not be used in those big data app design.
I love to be proven wrong by someone re-implement the mysql, (or even just sqlite) in "pure golang" (not sqlite driver in C) that can provide similar level of performance with golang's garbage collector compare to sqlite's C counter part.
The use case I care about is DB size of 0.1GB - 50 GB with hundreds of millions of records in the DB.
Go has mmap and casting via the unsafe package, so the go solution to your performance problem looks exactly the same as the c++ solution.
I'd say go is actually an ideal language for building a database because:
1) syscall's and low level io (including mmap) are all pretty much first class.
2) networking (duh)
Some final ramblings:
I realize "systems programming language" has lots of different definitions, some of which go doesn't meet due to it's stop the world gc. I'll concede that point, but as a parting shot let me remind you all that free'ing a tree of objects can cause similar pauses, although at a more convenient time. For me, I would assume that there are no guarantees in a database type system, you just want to go for something like: "99% of requests complete in less than a millisecond under certain controlled conditions". You never actually know if:
1) You have disk contention for that data you're reading causing pauses.
2) You've been swapped out and that pure in-memory operation slows down by an order of magniture
3) A billion other things.
For me personally, I don't really mind adding
4) The golang gc ran
To that list of things because benefits are worth the tradeoff, but that decision depends on the project.
I knew in theory one can use mmap and unsafe pointer in go.
One main different between C and go is this:
C can type cast any mmap pointer and de-reference it got the values (pointer, offset to other locations) extremely fast - < 0.01 micro-seconds (assuming page is swap in.) BTW, I do track C line code execute time nano-seconds resolution all the time. It is not difficult to do with RDTSC instruction. It gave you timing resolution in term of CPU clock cycles - 1 2GHz CPU clock cycle = 500 picoseconds, One can easily see the system paging and context switching info from those resolution.
In C code, that is only a few assembly instructions. In go (or other GC type language) that one operations is translate to multiple C functions calls. That's probably tens if not at lease hundred times different in pure CPU cycles.
When one has to do this hundreds of millions times, that usually makes huge different in overall execution time. I personally know a lot of test cases that means minutes to finished certain operations v.s. days (+GBs of RAM).
I developed software system with hundreds of millions to a few billions data entries items in the core object store (C++) and need to very fast access time. There are needs to open/load/read those objects in database in very fast speed (< 0.01 seconds). When I profiled those operations in early version of the code, the biggest bottleneck is always the malloc/free when the system reached > 10 millions records.
To get around those limitation, I end up design the data structures / datastore that eliminated the malloc/free, new/delete and do my own memory manager to map all the info directly to the structure in the file via mmap and do my own sub allocations. With those, I can get everything in < 0.01 seconds constant access time regardless the size of the database. (from startup of the program to return the value to client.)
I think all modern day database or BIG DATA type system app depend on its own custom memory manager. My theory, a language that depend on garbage collection can not be used in those big data app design.
I love to be proven wrong by someone re-implement the mysql, (or even just sqlite) in "pure golang" (not sqlite driver in C) that can provide similar level of performance with golang's garbage collector compare to sqlite's C counter part.
The use case I care about is DB size of 0.1GB - 50 GB with hundreds of millions of records in the DB.