Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microsoft Open XML embarrassment: spaces go missing between words (itwriting.com)
45 points by bensummers on Feb 23, 2011 | hide | past | favorite | 75 comments


"as I understand it a large part of the point of Open XML is to preserve fidelity in archived documents"

No, the point was to preserve a monopoly. The stuff about preserving fidelity was just a smokescreen since anyone paying attention knows that Word has a long history of changing formatting depending on which printer you have plugged into a machine. If you want that kind of archival then use the PDF/A standard.


The main purpose of Open XML is to facilitate processing and creating documents programmatically using standard tools to manipulate the DOM. Considering the vastness of resources which exist across platforms for manipulating XML (including those for translating to and from XML schemas), this is hardly a proprietary move on Microsoft's part. If they wanted proprietary, they would have copyrighted the format as Autodesk did with the Revit file format.


I don't know if you've actually examined the OOXML format, but I spent about 6 months dealing with it in great detail at my old job, specifically Excel. For simple things, conventional XML manipulation tools could potentially be useful, but there are some serious issues.

For one thing, the files office generates and consumes are significantly different from the format as specified in the ECMA and ISO standards. The schemas for ECMA-376 will specify elements as occuring in any order, while Excel will require those elements to appear in a fixed order. Tons of tiny problems like this exist everywhere, and you only need one mistake to create a malformed document. The machine-readable XML schemas provided by Microsoft will not work without massaging.

The format also makes it very difficult to make simple in-place changes. A good example of this is what you might consider the most basic information in a spreadsheet- cells. Cell tags are contained in row tags which are contained in a sheetData tag. Rows contain their row number as an attribute, while cells contain their A1 "row/column" ID. This means if you remove or reposition a row in a spreadsheet you must update every cell in every subsequent row.

How about extracting a string value from a cell? The format makes it possible to store inline strings directly in a given cell, but Excel itself never uses this functionality. Instead, the cell contains an index into the SharedStringTable, which is stored as a separate XML document. If you delete or modify a cell, it might be referencing an SST entry that is no longer used. The only way to know is to search the document globally, remembering that dozens of different elements could potentially refer to a shared string. If your goal is to avoid bloating documents with junk, you have to solve this problem for a number of cases- style references, fonts and more.

If you want to modify OOXML documents in a robust, thorough manner, you'll deal with tons of issues like this.


I've also done some work dealing with the Excel file format for a personal project (.xls not .xlsx). I think it needs to be clarified (not that you're implying this), that at least a lot of this mess isn't deliberate obfuscation on Microsoft's part.

For instance the SharedStringTable is something that made a lot of sense when documents had to fit on floppy disks. Excel is 26 years old, a when you evolve the file format for as long, while trying to maintain backwards compatibility, you'll inevitably be stuck with a messy format.


I doubt that backwards compatibility is a priority at MS Office development


I've looked at OOXML on the Word side and I would agree that although it is human readable, the degree of cross-referencing makes understanding a typical document non-trivial. On the other hand, the technical threshold for processing the sort of issues you mention in XML is lower than doing so with binary data. Given the scale at which Microsoft operates that probably translates into higher productivity for their customers regarding programmatically manipulating or creating Office documents.

Although - all things being equal - unused entries and the potential for bloat are undesirable, they are rarely the primary goal for a project. Considering that compression is built into the file format, outside of outlying projects it probably does not count among the chief considerations of a typical project (even if it is aesthetically unappealing). And again, garbage collection within a binary format is an even more challenging task.


> The main purpose of Open XML is to facilitate processing and creating documents programmatically using standard tools to manipulate the DOM. Considering the vastness of resources which exist across platforms for manipulating XML (including those for translating to and from XML schemas), this is hardly a proprietary move on Microsoft's part.

This is incredibly naive.

Microsoft was starting to fill pressure because some government contracts require a ISO standards if those exist -- and OASIS gained ISO standardisation, making it all of a sudden the ONLY document format for those contracts (And OpenOffice / KOffice the contenders).

Microsoft was repeatedly invited to participate in OASIS. They wouldn't.

Microsoft rushed it through ECMA, which acted as a rubber stamp, because of its fast-tracking agreement with ISO. Microsoft than stacked the local ISO chapters in its favour (with ridiculous results such as in Sweden) in order to get it approved.

And Microsoft, in fact, didn't actually implemented the OOXML spec that they published. A spec that includes attributes like "doSpacingLikeWord95"...

You've either drank the microsoft koolaid, totally unaware of history, or willfully naive. The only raison-d'etra of OOXML is to preserve Microsoft's monopoly.


Look at the effect, not the justification: just when OO XML was being talked about being standardised, and .doc was more or less interoperable, Open XML comes out. It's an open standard, but it requires effort to convert to it. Microsoft has money to burn so it can do the conversion; its competitors cannot.

Copyrighting the file format puts them in legal hot water: since it is a de facto standard, trying to prevent competition would likely violate anti-trust laws. This way they can appear to be open whilst being closed.


I wouldn't know how to write this bug...

It should serve as a testimony on how hard it is to implement Office Open XML. Not even Microsoft can get it right.


Note that nobody implements Office Open XML as in the ISO standard.

Microsofts implementation is incompatible to what ISO specified, mainly because in the ISO standardization process, ISO dropped some of the byzantine stuff in OOXML. Microsoft never adjusted its implementation to that, so AFAIK today there is no implementation of the standard in existence.


Which is what is so frustrating about major corporations and their approach to the law. Or rather how the law is enforced.

Small deviations from speed limits are not particularly harmful, but small deviations from the standardisation process here or breaching insider information rules in banks make a mockery of the system. They can pass the infractions off as unavoidable incompetence, but really the system should not except these specious half-truth excuses in these circumstances and should come down much harder, sooner and more often.


"the law"?


Governments often require "standards" in their purchases for very good reasons, like not requiring all citizens to purchase a proprietary solution to interact with them or getting lower prices thanks to competition. If vendors claim to be delivering standards, but aren't really then it's not much different from selling devices or services that don't meet the requirements. Obviously there is a line where incompetence becomes fraud.


It easy to say that, but the practical realities are different.

This explains it better than what I can write:

http://www.joelonsoftware.com/items/2008/02/19.html


I like how Joel explains the necessary reasons why the file formats in question have to be so bad, while tacitly admitting that having a good reason for being bad doesn't make bad code any more useful.


The takeaway for me is that designing, developing and maintaining a full featured office suite is incredibly hard. And standardizing it is even harder. Look at all the teething problems that ODF had.

OO.org/LibreOffice doesn't even conform fully to the ODF standard. See issues like http://www.zdnetasia.com/ooxml-expert-odf-standard-is-broken...


This might be a funny quip, but I disagree. Implementing anything is hard, and weird bugs creep in. I've never seen any program without bugs. So it's pretty disingenuous to take one bug in Word and say that it shows there's something wrong with Open XML.


> Implementing anything is hard

Come on. Implementing something you invented should be easy. I would understand if the [Open|Libre]Office folks got it wrong, but Microsoft? The same company that basically discredited ISO (and badly damaged its function afterwards) in order to standardize this monstrosity? To not even bother to implement its botched bogus standard correctly is beyond insulting.

And yes, of all things wrong in MS Office Open XML, the bugs are the least important.


Come on. Implementing something you invented should be easy.

That's absurd. So you're saying there's no Apple bugs in Quicktime or Cocoa? You're saying there's no bugs in Emacs that Stallman wrote? You're saying that there's no bugs in Mathematica written by Wolfram? You're saying that there's no bugs in Java produced by Sun? You're saying that Ken Thompson wrote no bugs in Unix. Stroustrop wrote no bugs in C++.

I've never seen a non-trivial program, standards-based or not, that is bug free, period. Not one.

Heck, there's a 30 year old bug in binary search that largely went unnoticed -- even Donald Knuth missed the bug!

Bugs happen in trivial programs. Any non-trivial program will have bugs.

This is completely insincere. Unless you're willing to say the same thing about ODF and virtually every other file format that exists, since I can find bugs implementing just about all of them from their core proponent.


Get used to it. Regardless of the technical merits, it's cool to hate on MS and blindly support Apple/Google on Slashdot, Reddit and even more so on HN. I've seen people quit HN in disgust because of the arguments, comments and moderation of Apple fans on here.


There's a bug somewhere in Open/LibreOffice where an old ODT file opened in a new version loses spell checker support, and no tinkering with dictionaries will fix it. It's also carried through a copy and paste.

And it happens in Word 2010 when I try to open the same ODT file.

Bugs happen.


> Bugs happen.

But one could assume they could, at least, implement correctly something they invented.


This comment really make me wonder if you have coded on a large project with a significant group of people. Building an iPhone app is nearly impossible for a single coder to get 100% right, where they have massive control of everything. Add the heterogenous environment of Windows and dozens of programming groups trying to come together, and it is nonsensical to say "they could, at least, implement correctly something they invented."

Software is way harder than that.


I think Microsoft's view of software quality has contaminated the industry. If you can't build your own spec correctly, maybe that's because you got overambitious with it.

I find it ludicrous that Microsoft could write the spec, find the resources to corrupt the process at ISO, discredit a valuable institution, cripple it by inflating membership with members that don't participate on any other issues in order to promote a standard that aims to be impossible to implement by third parties and be unable to command the resources required to implement it correctly in the first place.

Didn't they have a reference implementation for the standard in the first place?!


Why hold Microsoft to standards (no pun intended) that no one else can meet? Documents produced by Office validate better against the OOXML transitional spec than documents produced by OpenOffice validate against the ODF spec.

If the ISO process had followed its normal course, the final OOXML spec would have been close to what went in, with a few fixes. Instead, IBM and a few others tried at every step to stop the process, and if they couldn't stop it, they pushed through significant changes to the spec. In effect they changed the process from standardizing (with some cleanup) an existing format into writing a new format.


So you mean Microsoft is to blame for Netscape 4.x sucking and crashing on every OS?

Office has code dating back to the 80s. Read more for a backstory http://www.joelonsoftware.com/items/2008/02/19.html


Who said anything about Netscape? No. It's not OK for Netscape to ship buggy software and it's certainly not OK for a 200+ billion dollar company to do so.

What's the excuse? They couldn't hire programmers to correctly implement their own spec years after it being published?


>Who said anything about Netscape?

You did, in a way. By saying this:

>>I think Microsoft's view of software quality has contaminated the industry.

>They couldn't hire programmers to correctly implement their own spec years after it being published?

Throwing bodies at something doesn't make it right in software engineering. Haven't you heard of bugs and issues in Google's or Apple's products? After all, Apple is a bigger company now.

I was going to link to some Apple bugs but Apple Discussions was down (cue 'why can't a 200+ billion company keep their forums up? Can't they hire more people to fix it?' )

Unable to add beyond 100 pages to document(maybe they need to hire more people to hit 200?) http://discussions.info.apple.com/thread.jspa?threadID=26879...

http://arstechnica.com/apple/news/2010/07/apple-looking-into...

http://www.youtube.com/watch?v=Pdk2cJpSXLg&feature=playe...


> in a way

You mean you interpreted as me saying something about Netscape.

> Throwing bodies at something doesn't make it right in software engineering

Usually no, but Microsoft can throw a lot of bodies and rebuild from scratch, this time with good engineering.

> Apple is a bigger company now.

I don't think so. It's just more valuable.


Rebuilding something like Office will take a decade and will likely have no benefits at all.

Rewrites rarely make sense at all. Just see how old some of the code thats in widespread use is. Android is built on Linux that started in 1991(ignoring that Linux was based on the even older Minix and it's Unix roots). OS X/iOS are based on BSD, Darwin and Unix which are quite old.

See http://www.joelonsoftware.com/articles/fog0000000069.html

>I don't think so. It's just more valuable. The point still stands.


> Rebuilding something like Office will take a decade and will likely have no benefits at all.

I think that allowing to launch better, faster, safer, leaner and stabler versions faster with less bugs and for less money, while, at the same time, uncovering and correcting bugs that have been in the codebase for decades would be a plus.

> ignoring that Linux was based on the even older Minix

That's good, because it was not. One could say it was inspired on Minix and Unix.

> OS X/iOS are based on BSD, Darwin and Unix

OSX is more closely related to NeXTStep

Your history lessons are failing you.


>I think that allowing to launch better, faster, safer, leaner and stabler versions faster with less bugs and for less money, while, at the same time, uncovering and correcting bugs that have been in the codebase for decades would be a plus.

Did you even read the link I provided?

Netscape killed itself in rewrite, MS almost did that with Vista before hitting reboot and starting over with old code and now you would want MS to undertake a super expensive rewrite? Things rarely work so ideally in the real world, especially for humongous feature/code bases like Office.


Yes, I did read your link. The relationship of NeXT to this discussion (about Microsoft's inability to correctly implement something they invented and that they want others to implement too - because it's a standard after all) escapes me.

Attributing Netscape's demise to a rewrite of the browser is a bit exaggerated. They were under enormous pressure with a company with more resources to spend monthly than their entire market cap and giving away a browser bundled with Windows. The pressure to deliver new versions made them cut corners and allowed the browser to accumulate an enormous quantity of kludges that culminated with the need to throw it out and restart from scratch.

If anything, they should have been rewriting from the start, never allowing the cruft build up. It's an investment that pays back more often than not, specially if you are under the pressure to deliver new features quickly.

BTW, your fixation with Netscape is interesting too. You brought it up and tried to reason I did. That's also something that escapes me.

Why would a full rewrite of Microsoft Office be so expensive? How much did Sun, Oracle and independent collaborators put in OpenOffice anyway? Microsoft has the resources for that. The reason they don't do it is because they don't need to.


That doesn't seem like a good thing to assume. Every invention has issues.


It's sad to see even otherwise knowledgeable people jump on bugs when it's Microsoft while Apple(eg. 3rd generation iPod Touch and iPhones getting superslow for months with iOS 4.0 update), and Google (all the crazy unfixed bugs in SMS like in the other article on the FP, contrast the comments there) get a free pass.

Managing the tens or hundreds of millions of code dating back to the 80s is not easy.

http://www.joelonsoftware.com/items/2008/02/19.html

When it's MS, the comments and moderation are always about malice and incompetence, even on HN.

I remember someone on Reddit calling HN an Apple fanboy club. I guess they're not that far from the truth looking at the comments and moderation for the articles here.


Why are you implying that Apple and Google being fallible makes it OK to ship buggy software? It's not OK. It happens, regrettably, and has to be corrected.

> Managing the tens or hundreds of millions of code dating back to the 80s is not easy.

It seems they should tackle an easier problem then. This one is, evidently, too hard for them.

> When it's MS, the comments and moderation are always about malice and incompetence

Incompetence, malice and... What would be the third explanation?

> I remember someone on Reddit calling HN an Apple fanboy club

It's been a long time since it's not.


It's next to impossible to ship without bugs for something the scale of Office. Your comments seemed to blame Microsoft, so I was giving examples of bugs in other products that have nothing to do with MS.

>Incompetence, malice and... What would be the third explanation?

Nothing really, but my point was about all the negativity in the comments and moderation when it's MS vs. many other companies. I don't like Microsoft but I don't think they deserve such a raw deal while other companies get a free pass for very similar issues.


"Come on. Implementing something you invented should be easy."

Sounds like someone who has never implemented something more complicated than a hello world.


Sounds like you don't know me.

In the past 25+ years I got my share of bugs. But having a correct implementation of a spec is kind of the only proof you can have the spec is complete and implementable.

Not having one is just sloppy.


Of course I don't know you, but with 25 years you should be in a position to recognize that a spec as complicated as this one is near impossible to implement 100% correctly. Even more, there is no way to tell if the spec is implemented correctly. Certainly when you actually have to ship something - you can't just spend 10 years polishing the implementation, testing millions of edge cases.


> a spec as complicated as this one is near impossible to implement 100% correctly.

I gather being impossible to implement by third parties was a design requirement that was accomplished. What I find surprising is they got carried away and made a spec they couldn't implement either.

> there is no way to tell if the spec is implemented correctly

That's why a correct open reference implementation is a must. You can always say that any corner cases can be resolved according to the code.

When you make your spec excessively complex you are just asking for trouble.

This spec, BTW, exists for the sole reason as to legitimate a Microsoft format as a standard competing against ODF. It exists not to be implemented correctly, but to fragment the marketplace, preventing the standardization of something Microsoft cannot control and use as leverage.


> This spec, BTW, exists for the sole reason as to legitimate a Microsoft format as a standard competing against ODF. It exists not to be implemented correctly, but to fragment the marketplace, preventing the standardization of something Microsoft cannot control and use as leverage.

ODF was never a feasible alternative. First, Sun retained veto power via the threat of patents over ODF. Sun's patent grant for standardization was limited to a particular version of the standard and any future versions whose standardization Sun participated in. If they wanted to derail attempts to take ODF in a direction that they did not approve of, they could withdraw from the committee leaving the standard unprotected from their patents.

Second, Sun made it clear that ODF was only going to have those features necessary to support Star Office. There were attempts to make it more general, so that it could be a universal format, but Sun squashed them.


Apart from the tin foil hat, I love the doublethink in this. So basically you're saying 'screw the spec, make an implementation and make that the spec'?


Make a spec and make a reference implementation. Is there something wrong about it? What do you propose? Make a standard nobody can follow?


Sitting around writing specs while the competition runs away with the market does not really help in many situations. Guess this was even more true in the 80s when Office was originally developed.

Choose between 'sloppy and shipped'(most software) vs. 'never ships because of targeting perfection' (GNU Hurd?).


Over-engineering increases the likelihood of bugs.

Simple and elegant = less bugs.

XML is the epitome of over engineering.


Of all the objectionable things about Microsoft's Office Open XML, the (mis-)use of XML is fairly far down the list.

Trying to create a bug-compatible version of .doc is hard, as the sterling efforts of OpenOffice and others have shown. An attempt to create one in XML, by the very people who were best placed to just document the original format is way beyond good or bad engineering.


They did publish documentation for the original formats (iirc around the same time they published the new XML formats), btw.


It was a few years later on, after some wrangling with the EU about their monopoly status.


Representing documents should actually be one of the uses XML is suited for, and was the original intent. Prime example: HTML


HTML predates XML by a couple years. HTML is losely based on SGML.


Where in the link supplied is there any evidence that this is a bug in the file-format? Before you jump on the bandwagon of cheap Microsoft-stabs, at least do some basic critical reading of the source you are relying on.


Do you have any other explanation?


Why is it so difficult to understand the difference between executable code and a 'standard' file format?


Noticed this the minute I installed the Office 2010 beta, but vice versa: the spaces from my Word 2007 document were gone. Made me go back to 2007: wasn't going to rewrite my 20 page report.


> Office 2010 beta

> <sad story about a bug>

> wasn't going to rewrite my 20 page report.

And the moral of this story: don't use software that is officially of "beta" quality for important work. This is not specific to MS Office.


You're absolutely right, but the best test there is is the real-life one. And I was prepared for this so could go back. No animals were hurt, no dogs were blamed for the homework, but it have a great first impression of the "improvement".


Link to original CNET article: http://news.cnet.com/8301-1001_3-20034213-92.html

Link to original thread on Microsoft forum: http://social.answers.microsoft.com/Forums/en-US/wordshare/t...


That is a pretty obscure test case. And I like how the "severe" impact this had was that someone got a bad grade on a paper.


Sending a file in the default format of the world's most popular word processor to another party who opens it on a computer with different settings and a slightly older version of world's most popular word processor doesn't sound especially obscure to me.

I'm assuming most people that fail to get invited to job interviews because missing spaces make their resume/cv look careless to a prospective employer opening the file in Word 2007 probably aren't going to be aware of the reasons behind the decision...


Well, it's not like the test case can say "different settings." The devil is in the details.


This sounds like simply a bug in Word 2007 or Word 2010, not a problem with the document format.


[deleted]


Only post and account created: 11 minutes ago

Flagging away.


As far as I see this, he admits the problem at the end: Microsoft Word is a Word Processing program, not a publishing program.

You are not guaranteed to have your layout preserved when printing. This can be for a variety of reasons, but it could be simple stuff like having your file in A4-format and your printer only having papers of type "Letter" or something equally silly. In cases like these, Word is forced to reformat.

If you need or rely on a 100% accurate re-representation of your content, Word should not be your tool. Never. Use PDFs. Simple as that. If you are writing a normal texts however, it will probably never be a real problem which you will even notice.

Now, my question to the author (should he check out HN): How on earth is this related to OOXML? Where is the smoking gun saying this is a bug in the file-format? I honestly don't see it, and I don't see anyone else here questioning this unbacked claim.

I honestly expected better from HN.


Check the article again. It's not about layout preserved when printing, it's about actual spaces being deleted in the editor and then kept deleted when you save the file.


Maybe I need to spell it for you and those downvoting the OP. The headline says 'Open XML' (implying a bug in the standard format) whereas the article is talking about Office 2007 and Office 2010. See the difference?


These look like implementation defects to me. I do not know if the problem is in Word 2010, or Word 2007, or both. It is specific to Open XML since if you use the .doc format the problem goes away. It is embarrassing since Open XML is meant to the new and better format, and this is a good reason not to use it.


It's the implementation of the standard in Office 2007 and 2010


So a more appropriate title could be 'Microsoft's Office embarassment' ? Or not?


I think MS Office Open XML and MS Office are pretty much very related. Nobody would implement that standard if it weren't for Microsoft pushing it.

And the title implies that, apart from MS Office Open XML, Office is not embarrassing. One may disagree with that implication, but the title is more specific.


There is a big difference between a file format and the program that makes use of it(even though they are related and even if the same entity developed both) that you're failing to grasp in multiple comments.


If you read the title carefully, you may also interpret as Microsoft's Open XML embarrassment. Since I am unaware of other Microsoft products that implement the standard, I see no problem with the title. It's embarrassing, it's embarrassing to Microsoft and it's about MS Office Open XML.


"If you need or rely on a 100% accurate re-representation of your content, Word should not be your tool. Never. Use PDFs."

Or a Mac.



Interesting to be downvoted even when HN requires 500 karma to get downvote rights. Sorry for stepping on the toes of Apple fan club members.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: