Hacker News new | past | comments | ask | show | jobs | submit login

Struct of arrays (also called MultiArrayList in Zig), instead of storing big structs in an array you store each field in a separate array and if you need to fetch the full struct you reconstruct it from the arrays.

The benefit is that the arrays items memory size is smaller and has no padding, it also increases cache locality.




Now I remember reading a blog post recently, about an Areta library (one of the rust ones maybe?) that were doing both: array-of-struct-of-array.

The idea was to split the main arrays up into “chunks” that were structs, themselves containing rows of data (still array oriented).

The idea was that you retain the cache/layout/performance benefits of SoA layout, except each chunk is already packaged up into lengths ready for pushing through SIMD. Quite a clever idea.

Found it: https://www.rustsim.org/blog/2020/03/23/simd-aosoa-in-nalgeb...


Sounds like a primitive of columnar databases.


Entity-Component Systems do exactly this for better cache locality


this is a great way to store structured data in js since it saves the memory cost having repeated keys. e.g.:

  records = {
    time: [1000, 1001],
    price: [20, 25],
    volume: [50, 15]  
  }

  records = [
      { time: 1000, price: 20, volume: 50 },
      { time: 1001, price: 25, volume: 15 }
  ]
  // not a big difference with 2 records, but for xxxx records...


Decent JS engines will use "hidden classes" to dedupe keys for you already, so this isn't necessary to save space; the technique is pretty old and dates to Self. Still, the arrangement may help with locality of reference.


In practice, most js engines these days can ‘recognise’ the ‘class’ of these objects (if you create them from scratch in a few places) and the memory representation would end up with a word for the ‘class’ which says that time is at field 0 and price at 1 and volume at 2, and then the data itself. The main reason is to speed up code that reads the fields rather than memory use.


I remember BASIC in the 80s when all I had were arrays of integers or strings. To have complex data structures I used one array per field. I don't think the CPU had cache though (Z80) /grin


Wouldn’t that hurt locality? Since now you need to do multiple access across the entire heap to reconstruct one object.


It depends. For operations or aggregates on a single field, it improves cache locality, whereas if operations act on all, or most, fields of the struct, it hurts cache locality. The exact same tradeoff differentiates OLTP (transactional/row stored) databases and OLAP (analytics/column stored) databases.


You wouldn't typically use this when you need lots of random accesses, but when you process the data sequentially.

It also simplifies and speeds up SIMD loads and stores, because you can load/store entire SIMD registers with a continuous, non-strided memory access.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: