More

tanelpoder · 2025-09-07T05:21:13 1757222473

Optane memory modules also present themselves as separate (memory only) NUMA nodes. They’ve given me a chance to play with Linux tiered memory, without having to emulate the hardware for a VM

tanelpoder · 2025-09-07T01:55:18 1757210118

If this link doesn't work, just go to https://google.com/ai - it redirects to https://www.google.com/search?udm=50&aep=11 (HN strips the udm= field apparently).

This URL allows you to jump straight to the AI chat mode, without having to search for something first.

tanelpoder · 2025-09-06T20:38:02 1757191082

Yup, for best results you wouldn't just dump your existing pointer-chasing and linked-list data structures to CXL (like the Optane's transparent mode was, whatever it was called).

But CXL-backed memory can use your CPU caches as usual and the PCIe 5.0 lane throughput is still good, assuming that the CXL controller/DRAM side doesn't become a bottleneck. So you could design your engines and data structures to account for these tradeoffs. Like fetching/scanning columnar data structures, prefetching to hide latency etc. You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).

Edit: I'll plug my own article here - if you've wondered whether there were actual large-scale commercial products that used Intel's Optane as intended then Oracle database took good advantage of it (both the Exadata and plain database engines). One use was to have low latency durable (local) commits on Optane:

https://tanelpoder.com/posts/testing-oracles-use-of-optane-p...

VMware supports it as well, but using it as a simpler layer for tiered memory.

packetlost · 2025-09-06T22:57:50 1757199470

> You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).

I'd bet contested locks spend more time in cache than most other lines of memory so in practice a global lock might not be too bad.

tanelpoder · 2025-09-06T23:11:22 1757200282

Yep agreed, for single-host with CXL scenarios. I wrote this comment thinking about a hypothetical future CXL3.x+ scenario with multi-host fabric coherence where one could in theory put locks and control structures that protect shared access to CXL memory pools into the same shared CXL memory (so, no need for coordination over regular network at least).

samus · 2025-09-07T05:08:19 1757221699

DBMSs have been managing storage with different access times for decades and it should be pretty easy to adapt an existing engine. Or you could use it as a gigantic swap space. No clue whether additional kernel patches would be required for that.

tanelpoder · 2025-09-06T20:31:18 1757190678

Yeah I saw the same. I've been keeping an eye on the CXL world for ~5 years and so far it's 99% announcements, unveilings and great predictions. But the only CXL cards a consumer/small business can buy are some experimental-ish 64GB/128GB cards that you can actually buy today. Haven't seen any of my larger clients use it either. Both Intel Optane and DSSD storage efforts got discontinued after years of fanfare, from technical point of view, I hope that the same doesn't happen to CXL.

afr0ck · 2025-09-07T00:32:20 1757205140

I think Meta has already rolled out some CXL hardware for memory tiering. Marvell, Samsung, Xconn and many others have built various memory chips and switching hardware up to CXL 3.0. All recent Intel and AMD CPUs support CXL.

tanelpoder · 2025-09-06T20:22:44 1757190164

... and if you have the money, you can use 3 out of 4 PCIe5 slots for CXL expansion. So that could be 2TB DRAM + 1.5TB DRAM-over-CXL, all cache coherent thanks to CXL.mem.

I guess there are some use cases for this for local users, but I think the biggest wins could come from the CXL shared memory arrays in smaller clusters. So you could, for example, cache the entire build-side of a big hash join in the shared CXL memory and let all other nodes performing the join see the single shared dataset. Or build a "coherent global buffer cache" using CPU+PCI+CXL hardware, like Oracle Real Application Clusters has been doing with software+NICs for the last 30 years.

Edit: One example of the CXL shared memory pool devices is Samsung CMM-B. Still just an announcement, haven't seen it in the wild. So, CXL arrays might become something like the SAN arrays in the future - with direct loading to CPU cache (with cache coherence) and being byte-addressable.

https://semiconductor.samsung.com/news-events/tech-blog/cxl-...

tanelpoder · 2025-09-06T16:55:14 1757177714

Yes, can confirm, the book is great. I was also happy to see that the author correctly (in my mind) used the term “embedding vectors” vs. “vector embeddings” that most others seem to use… Some more context about my pet peeve: https://tanelpoder.com/posts/embedding-vectors-vs-vector-emb...

tanelpoder · 2025-08-30T07:19:23 1756538363

When I'm on the fence about some (technical) decision, I use a "razor": if all options seem equal, go with whichever is the simpler one. The results are ok so far and it has been great for reducing my brain-energy spent on pontification and early optimization too far ahead.

I liked the post, but these kinds of articles do make sense to people who've already been through the trenches & view the advice from their seasoned experience PoV and apply it accordingly. But if people without such experience follow it to the letter just because it's written, can have surprises ahead.

tanelpoder · 2025-08-29T00:36:23 1756427783

I haven't tested Intel's efficiency cores (E-cores) myself - would these address the need for desktops/laptops?

shirro · 2025-08-29T01:08:45 1756429725

Apple and many arm mobile platforms also have a mix of performance and efficiency cores so it seems to be a proven approach. I guess it comes down to implementation. Intel's efficiency cores by themselves (eg N series) apparently make nice little appliances, often better value than something like a RPi. I don't know how much they help their higher performance devices conserve energy.

I have one of Intel's old desktop class processors in a refurbished ex-office mini-desktop plugged into a power meter running a few services for the household and the idle usage isn't terrible. I don't understand why my laptop doesn't run colder and longer given the years of development between them.

There is also the race to idle strategy where you spike the performance cores to get back to idle which probably works well with a lot of office usage but not so well with something more demanding like games or benchmarks.

tanelpoder · 2025-08-24T20:42:20 1756068140

Also, latest Oracle version (23ai) has added "concurrent on-commit fast refresh" functionality, where concurrent transactions' changes are rolled up to the MV concurrently (previously these refreshes were serialized).

https://oracle-base.com/articles/23/materialized-view-concur...

From the article: "Concurrent refreshes will only happen if there are no overlaps in the materialized view rows updated by each session."

tanelpoder · 2025-08-24T15:18:14 1756048694

In the database-nerd world, we had something like this about ~10 years ago, written by @flashdba. Still a good read:

https://flashdba.com/category/storage-for-dbas/understanding...