I wish this article had shown side-by-side examples. Back when I built document ...

_the_inflator · 2025-07-19T11:42:42 1752925362

By accident, I saw firsthand how a simple layout, such as a page and a few paragraphs, can make you question why formats like Markdown are even possible, because the number one text processing tool would throw such a gigantuan load of crude syntax at you for a few paragraphs.

Respect to MS for keeping the lights on.

People need to understand that there is no MS format per se, but different standards from which you can choose. Years ago, when OpenDocument was fairly popular, MS was kind of hesitant to use an XML format. XML is a strict format, no matter the syntax.

And I bet that MS intended such a complicated format to prevent Open Source Projects from developing parsers and MS from losing market share this way. I bet there are considerations about such a strategy discussed at the time, buried in Archive.org.

On the other hand, MS didn't want nor see the XML chaos, which would follow later on. XML is a format, and all it demands is being formally correct. It is like Assembler, fixed instruction sets with lots of freedom, and only the computer needs to "understand" the code - if it runs, ship it.

ZEN of whatever cannot be enforced. JavaScript was once the Web's assembly language. Everything was possible, but you had to do the gruntwork and encapsulate every higher-level function in a module that consisted of hundreds of LoCs. Do in hundreds of LoCs, what a simple instruction in Python could achieve with one.

Babel came, TypeScript, and today I lost track of all the changes and features of the language and its dialects. The same goes for PHP, Java, C++, and even Python. So many features that were hyped, and you must learn this crap nevertheless, because it is valid code.

Humans cannot stand a steady state. The more you add to something, the more active and valuable it seems. I hate feature creep — kudos to all the compiler devs, who deserve credit for keeping the lights on.

Someone · 2025-07-19T13:05:12 1752930312

> And I bet that MS intended such a complicated format to prevent Open Source Projects from developing parsers and MS from losing market share this way.

It wouldn’t surprise me at all if it simply was “the XML schema mostly follows how our implementation represents this kind of stuff”.

The source code of MS Word almost certainly has lots of now weird-looking design choices based on having to run in constrained memory. It also has dark corners for “we released a version that did this slightly different, so we have to keep supporting it”

ninkendo · 2025-07-19T20:35:07 1752957307

> It wouldn’t surprise me at all if it simply was “the XML schema mostly follows how our implementation represents this kind of stuff”

That’s exactly what it was. They originally had a binary representation (.doc) which was pretty much just a straight-up dump of their internal data structures to disk. When they felt forced to make an “open” “xml-based” format, they basically converted their binary serialization to XML without changing what it represented at all. It was basically malicious compliance.

dathinab · 2025-07-19T13:17:59 1752931079

as far as I understand just parsing OOXML is by far not enough to get anywhere close to having a reasonable correct understanding of the layout of the document due to how it's "supper flexible" in ways going "beyond the OOXML standard", i.e. you still have to reverse engineer tone of things.

(i.e. they worked around the "XML is a strict format" part ;) )

or at least it was that way way back then when OOXML was new and the whole scandal about MS "happening" to not correctly implement their own standard thing was still news (so like 10+ years ago)

dathinab · 2025-07-19T13:13:52 1752930832

I wonder how much of this is related accidental grown complexity (in their original closed format) and their WYSIWYG just doing dump stuff devs aren't sure how or why it ended that way but also don't want to touch least it breaks.

Which they then carried over into OOXML.

Just to be clear, MS has back then and recently again repeatedly shown very clearly they the whole embrace extend extinguish thing is the core of their action for most things open or standardized(1). And what is a better way to "extinguish" open text standards by making one themself which is build in a way guaranteed to not work well, i.e. fail, for anyone(/most) but first party MS products and then use that to push the propaganda fud that open text standards just can't be good.

So I'm very sure them having an obscure, hyper complex, OOXML "open standard" format where actually implementing it standard compliant is far from sufficient for correct displayed/interpreted documents is a very intentional thing.

But if you already have a mess internally it is a very good move to just use/expand on that, because it does give you a excuse why things ended up how they are and save implementation time.

----

(1): disclaimer: In between there where a view years where they acted quite friendly; Specific dev of MS still love Open Source in a honest way; in some areas open source also has won; and in some places it's just a very bad time vor "extend and extinguish" so it's not (yet) done; And sometimes it's done very slowly and creepingly; So yes you will find good MS open source project and contributions. But its still pretty much everywhere no matter in which direction you look as long as you look close enough.

dathinab · 2025-07-19T14:15:32 1752934532

honestly OOXML looks a loot like someone took a non XML format and gave it a XML encoding

like XML is a mark up language so it _should_ interleave quite "naturally" and well for text formatting tasks (i.e. see OpenDocument example or supper simple "ancient style" HTML)

but OOXML looks more like someone force serialized some live OOP object hierarchy with (potential cyclic) references and tone of subclasses etc.

tl;dr: i.e. it looks a loot similar to a simplified form of how text editors internal represent formatted test

like w:r looks like a text section, you could say a r_ow of wide characters or words, w:p looks like a subclass of a implicit type which is basically a `Vec<w:r>`, w:pPr looks like ".presentation" property of w:p, same for w:rPr, probably both being subtypes of some generic Presentation base class. w:t looks like a generic `.text: String` property. w:pStyle looks like a property of Presentation or it's ParagraphPresentation sub-class, it's `w:val` property makes it look like it's a shared reference which can be looked up by the key `"Para"`. w:b is just another subclass of Presentation you can use in any context etc.

which opens the question

"do they mostly just dump their internal app state"?

and did they make their format that over-complicated and "over" flexible so that they can just change their internal structure and still dump it?

which would also explain how they might have ended up with "accidentally" incorrectly implementing their own standard around 10 years ago during early OOXML times

and if so isn't that basically "proof" that OOXML isn't really an open format but just a "make pretend" of one?

xg15 · 2025-07-19T15:12:52 1752937972

I read somewhere that in the first versions of Office, the "documents" were literally just memory dumps.

So I guess they're going back to that old strategy...

Edit: Source might have been this: https://news.ycombinator.com/item?id=39402595 , so part of it might have been an urban myth.