Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Microsoft has had really sharp people working on spreadsheet performance for many years. I remember reading a blog post from I believe Joel Spolsky or someone talking about what excel is doing behind the scenes to achieve high performance and I was pretty impressed.

One example that comes to mind was that spreadsheets are just memory mapped files and the layout of the file on disk is identical to the data structures in memory. This allows them eschew translation to a data interchange format. So they got performance at the cost of interoperability, which is probably what's hampering open office & friends.



That’s certainly history, if you use a modern file format such as .xlsx, and, likely, also if you use the old format.

Microsoft likely changed several in memory structures when Excel went 64-bit, if not earlier.

One thing that Execl does is multi-threaded recalculation (https://docs.microsoft.com/en-us/office/client-developer/exc...)


> Microsoft Office Excel 2007 was the first version of Excel to use multithreaded recalculation (MTR) of worksheets. You can configure Excel to use up to 1024 concurrent threads when recalculating, regardless of the number of processors or processor cores on the computer.

Somewhere, there is probably someone running hundreds of threads for excel (likely in a beefy VM/VDI). It is probably wired so deep into their business that they are afraid to move to other methods (that are more scalable). But such is the power of excel. What you see is what you get is not to be underestimated.


IMHO there's no reason to memory-map the interchange format itself.

I would predict/expect that both LibreOffice and MS Office (with their modern XML-based formats) are actually mmap(2)ing some temp file and treating it as an "on-disk working-state heap", and then importing from interchange formats by allocating from that heap / exporting to interchange formats by chasing pointers that end up inside that heap. (This is, after all, what every RDBMS does for its working state. It's pretty optimal.)

Even if you have a memory-mapped interchange format, I'd still expect them to have a separate disk-backed working heap for all the stuff that doesn't belong in the file but is nevertheless very large (e.g. cached intermediary computation results of spreadsheet cells); and, if they have it, they may as well just use it for most things by default. Thus, I would expect that even in old versions of MS Office, the in-memory data structures were actually an interchange format of sorts—not the ones being updated with each keystroke, but rather ones that'd be memcpy(2)ed into on export. (This also prevents you from either having to add a page-table structure to your file, or else constantly "defragment" it as data structures change.)


Not sure why you were getting downvoted these are pretty reasonable comments. The database example and use of B-Trees you mentioned is a good one.

It’s definitely possible to create a performant and portable document specification and others have.

I just strongly suspect that the performance issue that libre office and others have is more of a manpower issue and not having equivalent resources and knowledge of the excel formats rather than some shortcoming in ms’ own file formats.


> One example that comes to mind was that spreadsheets are just memory mapped files and the layout of the file on disk is identical to the data structures in memory. This allows them eschew translation to a data interchange format. So they got performance at the cost of interoperability, which is probably what's hampering open office & friends.

That improves save/restore performance, but in and of itself doesn't do much about execution performance of macros.


It would help program runtime as well because the macro could be operating on data not loaded in memory. Not having to marshal and un-marshal data can save lots of execution time, especially in the face of non-contiguous reads/writes. Databases store information using B-Tree's for example so that they can calculate the offset of the data and jump directly to it. It would probably take a lot of gymnastics to get this from an XML or JSON file interchange format.

I'm sure they could probably come up with something that is both portable and performant but it's probably not a big priority at Microsoft.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: