Overall I think using SQLite locally to offload database work is incredibly powe...

NortySpock · on July 17, 2024

If you think those read speeds are great, try DuckDB (which has many SIMD improvements) if you want to blow your socks off.

bearjaws · on July 17, 2024

This solutions scales to tens of thousands of images in my testing (probably ~200k or more before searching is too slow). Why use a nuclear bomb when a stick of dynamite is good enough :)

codetrotter · on July 17, 2024

Do these SIMD improvements still apply if you’re compiling DuckDB to WASM and run it in a browser?

nilslice · on July 17, 2024

Yes, if the browser enables SIMD! https://caniuse.com/wasm-simd

drozycki · on July 17, 2024

Isn’t one OLTP and the other OLAP? I don’t understand why DuckDB is often suggested as a drop-in replacement.

NortySpock · on July 17, 2024

Yes, DuckDB is OLAP, SQLite is OLTP. I should have called that out.

But if you are doing aggregate or skip-scan analysis and if they're talking about read speeds and in-memory processing, well, SQLite leaves some performance on the table by being single-threaded, as far as I can tell.

bastawhiz · on July 17, 2024

> Meanwhile in AWS you would pay $27k a month to have the same IOPS as a Lenovo Thinkpad X1.

This is kind of an unfair comparison. Essentially nobody needs a million iops for their database. Even an extremely busy database doesn't need to scan all of the data it holds (or at least, if it does, you're using it very wrong—that's why we have indexes).

A fast disk is possible on a laptop because it's a tiny hop to RAM. And it's desirable because you probably have nowhere near 2TB of RAM handy, so you need it to be fast.

In the cloud you can get 2TB of ram for $11k/mo (4x r6g.16xlarge). Not that you need anything like that to run your database. Most of that data is never being queried.

It's also the case that a laptop workload is very different than a server workload. If I run a steam game, I want it open fast. My laptop isn't crunching numbers on all the bytes at that moment. When I run a table scan on a Postgres table, processing needs to happen on every single tuple. A million iops isn't useful if your CPU immediately becomes the bottleneck. A Thinkpad would simply never match the response times of a server with a tenth of the iops under load (if the workload required scanning huge amounts of data).

So yes, the iops are more expensive, but that's really not a metric that anyone in the target market is hurting over.

bearjaws · on July 17, 2024

"Nobody needs IOPS until they need IOPS"

I've had to scrub a multi terabyte database of PII before moving to a staging environment, it hurts. With modern data architectures, you may write the same data 4-5 times in its life cycle, staging data, data warehouses, marketing, PowerBI, Looker, etc...

Especially reporting solutions, where they may aggregate massive amounts of data and write it to temp tables.

It will require IOPS, and you will pay handsomely for it.

bastawhiz · on July 17, 2024

I don't know about your specific example, but doing an operation where you rewrite most of a multi terabyte database online is almost certainly not best accomplished with SELECT/UPDATE. Even if you need multiple passes, that's N terabytes times M passes times two. That's... not a lot of reads and writes. Dumping the database to files on blob storage, rewriting them, then reading them into your destination is almost certainly the fastest and cheapest way to go about that.

And that's not an iops avoidance thing, that's a "this isn't what your database is built to do with the configuration your running it in" sort of thing.

whalesalad · on July 18, 2024

Cloud providers have been delivering shit IOPS and charging unhinged prices for basic modern day performance that I really can’t agree with you at all. Nobody “needs” high IOPS the same way nobody needs a car that can drive faster than the 70mph. Why would anyone settle for an artificial ceiling?

bastawhiz · on July 19, 2024

It's not an artificial ceiling, because the storage and the compute are decoupled. The disk isn't physically in the server.

Zero times in the half decade that I've been administering terabyte-scale RDS instances has my server had to even restart due to a disk failure. In fact, I never even need to think about disk failures because they're automatically redundant. And if one occurred, nobody needs to crack open the server running my databases to fix anything, because the disk is replicated somewhere else. And if I need more space, I don't need to change the server at all: the volume just scales up (and delivers more iops) because another physical disk somewhere else gets RAIDed in. The cost of addressing the disk over the network is where the iops go.

That's hugely valuable. You simply don't get that if you're building the server yourself (which is really the only time you can achieve millions of iops). But even if you did build and run such a server, I can almost guarantee that you couldn't get a real production database workload to get above a hundred thousand iops unless you delete all of your indexes on terabytes of data. I would bet real money that you can't physically build a server that can maintain 1M iops serving real load on real data without artificial manipulation (both because you don't have a service with enough load to demand iops even close to that and because databases don't pull gigabytes from the disk every second).

Let's say you did build such a server. Now you're CPU bound. You simply can't throw enough CPUs at the database to process all the rows you're scanning at a million iops. Literally: the ability to distribute the rows for processing decreases as the number of cores goes up and you'll hit a ceiling. Bursting up to a million iops for the second needed to fill your RAM is useless if you then spend the next minute processing all that data on 128 cores. And even if you're just serving blobs that don't need processing, you're now network constrained. The only reason a laptop has that many iops is the user physically interacts with the machine, and the goal is to just fill RAM as fast as possible.

It's theoretically possible to build your own extremely optimized purpose-built database that can max out the resources of such a server. But the cost of writing such a server plus the expense of building and racking and maintaining the server will almost certainly scale the cost to an order of magnitude what you'd pay to just horizontally scale that workload to commodity software and cloud infrastructure.

To your analogy, nobody needs a car that can go 400mph. But why would you pay $5M for one? What roads can you drive it on? Why would you want to have to stop for gas or charge every ten miles? Are you okay with seating only the driver with no trunk space? Who is going to insure it?

stackskipton · on July 17, 2024

I'd also point out about AWS cost, which is pretty high, getting IOPS is easy. Getting Redundant IOPS is hard part.

Spivak · on July 17, 2024

Huh? Not really. AWS is a marvel of engineering in that it can be everyone's redundant IOPS but being your IOPS isn't such a big deal. Most folks are single region and rent-a-datacenter colo operations can get you two racks with separate power/uplink no issue.

I hope everyone at least once in their career gets to experience just how god damn fast hardware (especially networking speeds between your own servers) is. Sweet lord loading up three bare-metal dbs with 1TB ram each and bonded 10G nics where the app servers didn't even have to hit a router to talk to them. We initially thought the replication lag being pinned at 0 was a mistake.

Don't take this as any sort of condemnation of cloud offerings, being able to spin up comparable infra in my underwear and not having to think about purchasing and dealing with hardware vendors are truly a gift.

sgarland · on July 18, 2024

> I hope everyone at least once in their career gets to experience just how god damn fast hardware (especially networking speeds between your own servers) is.

THIS. I realized cloud disks were much slower than I thought when I ran the same tests in RDS – with a local NVMe cache – against my decade-old Dell R620, with its disks also being NVMe, but via Ceph over Mellanox Infiniband. My server matched or beat the many-generations-newer RDS instance on almost every query.

You can’t get around latency. Even at 1 msec, that’s a maximum of 1000 ops a single thread can do per second, modulo the various buffering and chunking strategies every layer does.

bigbones · on July 17, 2024

> Meanwhile in AWS you would pay $27k a month to have the same IOPS as a Lenovo Thinkpad X1.

Curious about how this number was arrived at and what products were involved

bearjaws · on July 17, 2024

RDS with 256,000 IOPS (the max) on a single primary instance of MariaDB.

Of course its not realistic to run a primary DB on a laptop, but the IOPS are extremely expensive in AWS was the point.

Offloading expensive queries to a browser is a viable solution. If I had decided to make this some sort of SaaS offering I am certain running full text search at scale would cost thousands of dollars per month long term, with the data becoming increasingly irrelevant over time, and I would still be forced to host it.

bearjaws · on July 18, 2024

It's now live at https://cluttr.ai

I'm going to write a few blog articles about some of the challenges I've had, and where I'd like to see the tooling of local first to mature.

resonious · on July 18, 2024

I wanted to try SQLite in the browser for a side project, but the startup time was unacceptable to me. It takes a solid second or two to crunch through the WASM binary, init SQLite, and open a database. Gets worse on first boot when you need to create all the tables.

There was a web.dev post or something saying "Web SQL is finally here! Just use SQLite with WASM!" Like, sure, that does seem to work, but it requires a huge WASM blob and a huge heap of JS glue. Indexeddb--as horrible as it is--starts up and is usable almost instantly. The sad part is that Indexeddb's querying is so bad you're forced to build your own database on top of it anyway...

I get where the web standards folks were coming from when they didn't like WebSQL just being SQLite in every browser, but frankly that would be so much better than what we have now even 5 years after back-peddling.

lenkite · on July 18, 2024

SQLite is not SQL standards compliant. It doesn't even support data types from core SQL-92. One needs more than INTEGER, REAL, TEXT, BLOB for applications. Adopting it as WebSQL would have been a terrible mistake.

AntonCTO · on July 18, 2024

Is IndexedDB SQL standards compliant? Inventing IndexedDB is a terrible mistake.

lenkite · on July 19, 2024

Fully agreed. Lets not compound a mistake with another mistake.

resonious · on July 19, 2024

Fair enough. But if I could pick just one mistake, I'd pick SQLite.

WuxiFingerHold · on July 18, 2024

A central Postgres (RDS) and a local SQLite are pretty for different use cases. You can't compare it.

CyberDildonics · on July 18, 2024

On a very slow SSD SQLite can query a 20 gigabyte database file in milliseconds.