More

tanelpoder · 2025-07-02T02:31:20 1751423480

After 8 years, worth reposting... Previous discussion here:

https://news.ycombinator.com/item?id=15474043

tanelpoder · 2025-06-18T20:07:28 1750277248

Yep, same here, first GUI editor that has ended up sticky enough. I've used just plain vim (or vi, back in the day) in a terminal. Occasionally tried out some GUI editors every few years, but none of them were suitable/sticky enough for my taste. Now I've been using Zed for 8-9 months, almost daily when coding vs. just testing/hacking something, and haven't fallen back to vim in a terminal for coding tasks yet.

tanelpoder · 2025-06-15T08:53:21 1749977601

The reluctance of using stored procedures where they’d be valuable is also a skill + will issue. I do get the non-database-developer view that if 98% of your app is gonna be written in some other language anyway, why complicate your app layers by adding a bit of PL/SQL code + testing infra here and there.

But for processing (and validating) various data loads and incoming data streams, by looking up matching values in existing reference tables, stored procedures can increase performance/efficiency and reduce complexity (of having to maintain connection pools and caching layers just for validating incoming record batches).

As far as the unwillingness to even evaluate stored procedures issue goes, I sometimes joke that maybe the DB vendors should start calling stored procedures as “database lambdas” instead…

bob1029 · 2025-06-15T09:16:55 1749979015

> But for processing (and validating) various data loads and incoming data streams, by looking up matching values in existing reference tables, stored procedures can increase performance/efficiency and reduce complexity

Performing operations on data directly in the SQL provider is the peak of human enlightenment. It takes a lot of leadership or wisdom to push a modern team away from using crap like EF to process everything, but 100x slower.

In exactly 0% of cases of cycling records through code will you see a higher performance result than executing a single T-SQL or PL/SQL script against the provider.

The procedural SQL languages are Turing complete. SQL itself is as of recursive common table expressions. There's not any reasonable argument for not trying this if all the information is already in the SQL store.

Moving information is way more expensive than processing information that is already in place (cache). Your SQL server process can iterate on items in L1 millions of times before a packet makes it across the data center one time.

slooonz · 2025-06-15T10:30:58 1749983458

Operational concerns trumps raw performances most of the time. Stored procedures live in a different CI/CD environment, with a different testing framework (if there’s even one), on a different deployment lifecycle, using a different language than my main code. It is also essentially an un-pinnable dependency. Too much pain for the gain.

Now, give me ephemeral, per-connection procedures (call them unstored procedures for fun) that I can write in the language I want but that run on provider side, sure I’ll happily use them.

bob1029 · 2025-06-15T11:11:42 1749985902

> Stored procedures live in a different CI/CD environment

They don't have to. The procedural SQL commands can be source controlled along with the rest of the codebase. Transmitting the actual command text to the SQL server that does all the work is not the inefficient part.

paulddraper · 2025-06-16T13:37:28 1750081048

Why do they live in separate CI?

tanelpoder · 2025-06-13T08:28:33 1749803313

Perhaps the grand vision was to later use all those newly built internal tools as reusable components in their customer-facing apps…

rcxdude · 2025-06-13T08:57:36 1749805056

Yeah, I feel like if their goal was to make custom software cheap and easy, this was basically just dogfooding.

pyman · 2025-06-13T09:34:20 1749807260

I would assume they were collecting data. If you're building a chatbot to talk to clients, you need all the transcripts from meetings, chat logs, project info, analytics, metrics, and so on. It's the only way to train your model properly.

Do they have any patents related to using chatbots for project management?

stego-tech · 2025-06-16T17:39:42 1750095582

Probably, but casting such a wide net so early has its risks - as is visible in hindsight. Build one, then use that to build the second, the first two to build the third, etc, etc.

Doing everything all at once is a recipe for disaster.

kumarm · 2025-06-13T16:55:34 1749833734

Most likely. But that's exactly what someone who hasn't experienced enough cycles in industry would come up with :).

stego-tech · 2025-06-16T17:41:55 1750095715

Exactly this. My startup count is zero, but even I’m well enough informed to know that casting such a wide net is diverting attention away into too many tangents, too quickly, and will drain your balance sheet dry before you can capitalize on any singular success to build a moat or secure another funding round.

They were banking on AI coding being better than it was, and the snowball effect happening faster than it ever could. And now, they’re toast.

tanelpoder · 2025-05-30T20:00:42 1748635242

Yeah, I just wanted a cool-sounding name for this. Nevertheless, it allows you to do easy stress-testing with some vector search operations (a quite narrow set, but you can combine it with joins and write your own queries if you like). But "CatStress" didn't sound too good to me.

It's a "Vector Search Playground" really, but the bigger value so far has come from not running maximum stress tests, but demonstrating people how you can join vector search results to the rest of your (existing) application schema. Plenty of people have thought that you need a completely separate, isolated vectorstore behind some API for this...

Edit: Also the setup part includes running a "generate_embeddings.py" script that uses PyTorch under the hood (on CPUs or CUDA/GPUs) to generate embeddings from the 25k photos (or 9M when using the rotated variants). That process can also be sped up and optimized for sure - my whole point is that once everything runs OK enough from end to end, then it's time to start measuring and optimizing the whole process - for learning and fun.

tanelpoder · 2025-05-30T01:06:14 1748567174

I took a completely different path to visualizing embedding vectors’ physical layout [1] - mainly to explain how the data structures, data volumes and comparison would radically differ, compared to your regular btree index searches. I made sure to mention that you can’t make any conclusions based on just human eyeballing of these vector heatmaps, but the database people I’ve demoed this to, seem to have reached some a-ha moments about understanding how radically different a vector search is compared to the usual database index lookup work:

[1] https://tanelpoder.com/posts/comparing-vectors-of-the-same-r...

tanelpoder · 2025-05-28T08:13:48 1748420028

I'll see if I have a chance to run such a test on AWS in coming days (and would need to keep running it for much longer than just 5 seconds shown in the blog).

If you care about WAL write/commit latency, you could provision a small-ish EBS io2 Block Express device (with provisioned IOPS) just for your WAL files and the rest of your data can still reside on cheaper EBS storage. And you might not even need to hugely overprovision your WAL device IOPS (as databases can batch commit writes for multiple transactions).

But the main point is that once your WAL files are on a completely separate blockdevice from all the other datafile I/O, they won't suffer from various read & write IO bursts that can happen during regular database activity. On Oracle databases, I put controlfiles to these separate devices too, as they are on the critical path during redo log switches...

CodesInChaos · 2025-05-28T08:23:06 1748420586

How do you backup those split volumes? I like EBS volume snapshots, since they're atomic, incremental, fast to restore and make it easy to spin up a clone. But obviously that approach won't work for split volumes.

tanelpoder · 2025-05-28T08:31:55 1748421115

Yep indeed, that's a tradeoff, if your DB fits into a single volume. I'm not that deeply familiar with databases other than Oracle (that has its own ways to work work around this), so for ease of use, everything on a single volume keeps things simpler.

One thing that I try to achieve anyway, is to spread & "smoothen" the database checkpoint & fsync activity over time via database checkpointing parameters, so you won't have huge "IO storms" every 15 minutes, but just steady writing of dirty buffers going on all the time. So, even if all your files are stored on the same blockdevice, you'll less likely see a case where your WAL writes wait behind 50000 checkpoint write requests issued just before.

CoolCold · 2025-05-28T10:54:13 1748429653

well, likely you can just use those as ZFS with ZIL or in more tradional setup with LVM + LVW writeback cache - which from my experience greatly improves latency

tanelpoder · 2025-05-28T04:36:27 1748406987

Throughput with enough I/O concurrency, yes. That's actually why I wrote this blog entry, just to bring attention to this - having nice IOPS numbers do not translate to nice individual I/O latency numbers. If an individual WAL write takes ~1.5 ms (instead of tens of microseconds), this means that your app transactions also take 1.5+ ms and not sub-millisecond. Not everyone cares about this (and often don't even need to care about this), but worth being aware of.

I tend to set up a small, but completely separate block device (usually on enterprise SAN storage or cloud block store) just for WAL/redo logs to have a different device with its own queue for that. So that when that big database checkpoint or fsync happens against datafiles, the thousands of concurrently submitted IO requests won't get in the way of WAL writes that still need to complete fast. I've done something similar in the past with separate filesystem journal devices too (for niche use cases...)

Edit: Another use case for this is that ZFS users can put the ZIL on low-latency devices, while keeping the main storage on lower cost devices.

natmaka · 2025-05-28T05:45:10 1748411110

> I tend to set up a small, but completely separate block device (usually on enterprise SAN storage or cloud block store) just for WAL/redo logs

I'm not sure about this, as this separate device may handle more of the total (aggregated) work by being a member of an unique pool (RAID made of all available non-spare devices) used by the PostgreSQL server.

It seems to me that in most cases the most efficient setup, even when trying hard to reduce the maximal latency (and therefore to sacrifice some throughput), is an unique pool AND an adequate I/O scheduling enforcing a "max latency" parameter.

If, during peaks of activity, your WAL-dedicated device isn't permanently at 100% usage while the data pool is, then dedicating it may (overall) bump up the max latency and reduce throughput.

Tweaking some parameters (bgwriter, full_page_writes, wal_compression, wal_writer_delay, max_wal_senders, wal_level, wal_buffers, wal_init_zero...) with respect to the usage profile (max tolerated latency, OLTP, OLAP, proportion of SELECTs and INSERTs/UPDATEs, I/O subsystem characteristics and performance, kernel parameters...) is key.

tanelpoder · 2025-05-28T05:55:02 1748411702

When doing 1M+ IOPS, you probably do not want to use OS IO schedulers due to the OS (timer & spinlock) overhead [1] and let the hardware take care of any scheduling in their device queues. But you're right about flattening the IO burst spikes via DB configuration, so that you'd have constant slow checkpointing going on, instead of a huge spike every 15 minutes...

All this depends on what kind of storage backend you're on, local consumer SSDs with just one NVMe namespace each, or local SSDs with multiple namespaces (with their own queues) or a full-blown enterprise storage backend where you have no idea what's really going in the backend :-)

[1]: https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...

Edit: Note that I wasn't proposing using an entire physical disk device (or multiple) for the low latency files, but just a part of it. Local enterprise-grade SSDs support multiple namespaces (with their own internal queues) so you can carve out just 1% of that for separate I/O processing. And with enterprise SAN arrays (or cloud elastic block store offerings) this works too, you don't know how many physical disks are involved in the backend anyway, but at your host OS level, you get a separate IO queue that is not gonna be full of thousands of checkpoint writes.

fendale · 2025-05-28T09:33:32 1748424812

> local enterprise-grade SSDs support multiple namespaces (with their own internal queues)

What do you mean by namespaces here? Are they created by having different partitions or LVM volumes? As you mentioned consumer grade SSDs only have a single namespace, I am guessing this is something that needs some config when mounting the drive?

tanelpoder · 2025-05-28T09:49:44 1748425784

With SSDs that support namespaces you can use commands like "nvme create-ns" to create logical "partitioning" of the underlying device, so you'll end up with device names like this (also in my blog above):

/dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 ...

Consumer disks support only a single namespace, as far as I've seen. Different namespaces give you extra flexibility, I think some even support different sector sizes for different namespaces).

So under the hood you'd still be using the same NAND storage, but the controller can now process incoming I/Os with awareness of which "logical device" they came from. So, even if your data volume has managed to submit a burst of 1000 in-flight I/O requests via its namespace, the controller can still pick some latest I/Os from other (redo volume) namespaces to be served as well (without having to serve the other burst of I/Os first).

So, you can create a high-priority queue by using multiple namespaces on the same device. It's like logical partitioning of the SSD device I/O handling capability, not physical partitioning of disk space like the OS "fdisk" level partitioning would be. The OS "fdisk" partitioning or LVM mapping is not related to NVMe namespaces at all.

Also, I'm not a NVMe SSD expert, but this is my understanding and my test results agree so far.

fendale · 2025-05-28T10:20:31 1748427631

Ah ok - so googling a bit on this, you do specify the size when creating the namespace. So if you have multiple namespaces, they appear as separate devices on the OS, and then you can mkfs and mount each as if its a different disk. Then you get the different IO queues at the hardware level, unlike with traditional partitioning.

tanelpoder · 2025-05-28T10:35:44 1748428544

Yep, exactly - with OS level partitioning or logical volumes, you'd still end up with a single underlying block device (and a single queue) at the end of the day.

tanelpoder · 2025-05-27T07:16:39 1748330199

I happen to be an Estonian, but this comment is general: I wonder when teaching kids to be AI-tool-literate, would this also help teaching some critical thinking and reading skills when consuming information from social media and Internet.

Since AI does hallucinate, you have to be able to dig deeper into the sources of facts when needed, to make sure that you're not being hallucinated lies. The same skills ought to be applicable for critical thinking/internet news literacy, as that loud "news" channel or TikTok influencer might also be "hallucinating" lies for you for their reasons...

tanelpoder · 2025-05-27T06:46:52 1748328412

I bought some Intel Optane persistent memory modules some years back to run some tests on them and maybe even build a PMEM-based KV store. I even wrote some articles about Oracle database's usage of PMEM for low-latency (local) commits [1]. But soon after I bought the memory, Intel discontinued the entire product line. I'm guessing that there just wasn't enough interest in practice as there were plenty of other (software-based) alternatives and NVMe SSDs got really fast too.

I've seen grand announcements, partnerships & alliances and unveilings of exciting CXL products that you can not order for the last 6 years by now. Or some CXL modules that add only 64 GB of CXL DRAM while taking 8x PCIe5 lanes. Perhaps there's lots of testing action going on with with select hyperscalers, but it's slightly worrying that after 6 years of announcements by major vendors (Samsung, SK Hynix, etc), you still can't just order a couple of these devices for your own testing.

I think shared, external CXL memory arrays that are connected to multiple servers (like SAN storage, but for cache-coherent, directly CPU-connected memory) would enable a few very interesting use cases for database engines:

1) Shared, cache-coherent global buffer caches for small-to-medium sized DB clusters

2) Shared build-side of giant hash joins for SQL (so you don't need to shuffle, repartition or broadcast data for your large joins)

Let's hope that this technology is real enough to become mainstream and actually available to users.

[1]: https://tanelpoder.com/posts/testing-oracles-use-of-optane-p...