Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I hope this is not too tangential, but I have been thinking about the best ways to make use of direct access to non volatile memories, ignoring the block and driver level and ordering the layout in your code. I suspect that your project is one which could take that direction very usefully.

I would very much like to hear your reaction to this suggestion, and I would like to add if you have looked at the work at weka.io for their take on the convergence of storage. I increasingly like the idea of having the ability to use the hardware and have the management directed at the applications level where the developer is able to use intelligent measures and policies to tune their systems in discreet manner from the OS. Close cooperation will enable the collection of data to provide a valuable resource for administrators and directly benefit the pace of production development, by providing a comprehensive universal instrumentation context.

I know this is almost arguing about the reason why databases should use their own on disk format and there is long history of the tradeoff involved with that.

But the trouble I have with the current storage space is that the equity in the file systems is not flexible enough for the kind of smaller mixed deployment I come into contact with in the lower small business market, as example of this, CEPH or any FS which is a monolithic investment in which you are going to find the most restricted resource is management time. Additional FSs are difficult propositions to small shops. But the idea I'm looking for is the application layer should be responsible for storage management and performance tuning and follow best practice set by the software publisher learned from collected instrument data.

I think the epiphany of the general operating system is nigh or even last century.

Nobody is able to use large software programs in a turnkey way, making assumptions about the OS environment. I hand wave plenty saying that, but certainly I can follow up flippantly to add that a friend's experience providing contracted management to small businesses is not atypical by my experience, he joked that he loves Linux because it meant he got a clean install and nobody likely to be able to know how to mess it up.

The context where I see very little leveraging of OS capabilities, particularly in the Windows Server user world, it looks like a lot of wasted effort and license expense.

I beg forgiveness in advance of this facetious illustration, but in conversation with a small business Web developer recently, I cited the example of Plenty of Fish, and rhetorically asked if he knew it was a one man gig, on Windows and IIS? He was unaware of this, so I teased him that he would be forever in his first money rounds and hiring, if he had similarly accepted a bet on building such a dating site, if he kept on reading HN so much... My joke is off color sorry, but I wanted then as now to make the point where it has become all too accepted to automatically get started with a complete development stack and seek advantages in terms of the customization and deep power that is leveraging highly experienced professionals, and I worry about whether we all just do too much of this, and it's time to review the situation more broadly than my rotten humour alludes to, because the problem, if it is a problem, is much wider.



I agree wholeheartedly with this statement: "application layer should be responsible for storage management and performance tuning". I would take it one step further and say that storage should be virtualized in a high-performance way. HDF5 was the old way, Parquet/Avro is the new way, and something like Apache Arrow is the future. We are currently focused on efficient, cross-platform friendly ways of serializing [columnar] data and have chosen Parquet. Optimizing to the level of volatile caches, though, is probably not something we're ready to tackle. The performance gains to be had by eliminating parsing and lazily loading data (in the spirit of DMA) are absolutely huge. And good file-formats accomplish that. Moreover, the amount of time and performance lost to moving data around is staggering. https://weld-project.github.io/ and https://arrow.apache.org/ sketch the solution: 1) optimize the entire computation graph to minimize data materialization; 2) have a canonical in-memory representation that can quickly serialize results to a variety of clients.

I have not heard of weka.io but will take a look.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: