By accident, I saw firsthand how a simple layout, such as a page and a few parag...

Someone · 2025-07-19T13:05:12 1752930312

> And I bet that MS intended such a complicated format to prevent Open Source Projects from developing parsers and MS from losing market share this way.

It wouldn’t surprise me at all if it simply was “the XML schema mostly follows how our implementation represents this kind of stuff”.

The source code of MS Word almost certainly has lots of now weird-looking design choices based on having to run in constrained memory. It also has dark corners for “we released a version that did this slightly different, so we have to keep supporting it”

ninkendo · 2025-07-19T20:35:07 1752957307

> It wouldn’t surprise me at all if it simply was “the XML schema mostly follows how our implementation represents this kind of stuff”

That’s exactly what it was. They originally had a binary representation (.doc) which was pretty much just a straight-up dump of their internal data structures to disk. When they felt forced to make an “open” “xml-based” format, they basically converted their binary serialization to XML without changing what it represented at all. It was basically malicious compliance.

dathinab · 2025-07-19T13:17:59 1752931079

as far as I understand just parsing OOXML is by far not enough to get anywhere close to having a reasonable correct understanding of the layout of the document due to how it's "supper flexible" in ways going "beyond the OOXML standard", i.e. you still have to reverse engineer tone of things.

(i.e. they worked around the "XML is a strict format" part ;) )

or at least it was that way way back then when OOXML was new and the whole scandal about MS "happening" to not correctly implement their own standard thing was still news (so like 10+ years ago)