I hope this is not too tangential, but I have been thinking about the best ways ...

akarve · on July 14, 2017

I agree wholeheartedly with this statement: "application layer should be responsible for storage management and performance tuning". I would take it one step further and say that storage should be virtualized in a high-performance way. HDF5 was the old way, Parquet/Avro is the new way, and something like Apache Arrow is the future. We are currently focused on efficient, cross-platform friendly ways of serializing [columnar] data and have chosen Parquet. Optimizing to the level of volatile caches, though, is probably not something we're ready to tackle. The performance gains to be had by eliminating parsing and lazily loading data (in the spirit of DMA) are absolutely huge. And good file-formats accomplish that. Moreover, the amount of time and performance lost to moving data around is staggering. https://weld-project.github.io/ and https://arrow.apache.org/ sketch the solution: 1) optimize the entire computation graph to minimize data materialization; 2) have a canonical in-memory representation that can quickly serialize results to a variety of clients.

I have not heard of weka.io but will take a look.