Hacker News new | past | comments | ask | show | jobs | submit login
AsmXml: a fast XML parser/decoder in pure x86 assembler (tibleiz.net)
132 points by gmac on Oct 8, 2010 | hide | past | favorite | 46 comments



This was for fun (as he says.)

> Remember: if you really need speed, do not use XML.

He defines his own schema language, in XML (not XML Schema), which the parser needs. Nothing wrong with that. You could even transform (some subset of) XSD's to it using XSLT. (see "2. defining the schema" http://tibleiz.net/asm-xml/tutorial.html and "defining the schema" http://tibleiz.net/asm-xml/documentation.html)

He stores attributes in an array, to give O(1) access (... if you assume attribute order is significant, which it isn't according to the XML spec and most tools... even though humans find them more readable if in an expected order. So, to find a specific attribute, you'll have to step through them.)

XML parsers are unbelievably inefficient.. (but c'mon... it's XML... all other sins pall to insignificance before the big one, as fireflies at dawn.) He makes the nice point that once you have the XML parsed, you can just use those strings directly instead of copying them again. IBM research came out with a this idea a while back, of combining the two, but it doesn't seem to have gone anywhere. I'd guess they patented it like crazy, because, after all, they are IBM. http://portal.acm.org/citation.cfm?id=1135777.1135796 (that's just the abstract - sorry, couldn't find the pdf.)

But again, all this cleverness wouldn't matter except that it's fun (if you really need speed, do not use XML.). Oh... and except in "XML appliances": http://en.wikipedia.org/wiki/XML_appliance

DISCLAIMER: despite my tone, I am pro-XML. It's very useful. Some of its limitations facilitate some parsing ideas that escaped everyone else because of their cost in terms of performance and expressiveness. Just as some people are so obsessed with performance that they miss modular elegance, some people are analogously obsessed with expressiveness. There! I think I've insulted everybody.


Oh my, that IBM research probably went into their DataPower appliances >> http://www-01.ibm.com/software/integration/datapower/perform...

My biggest gripe with XML is namespaces followed by XSLT. Working with XML at that point just leaves me feeling disgusted.

edit: here's the full text >> http://www2006.org/programme/files/xhtml/5011/p5011-mendelso...


> My biggest gripe with XML is namespaces followed by XSLT.

XSLT I agree.

Namespaces I quite like, the 3 issues I have being: * I can't write namespaced documents in Clark's notation. Clark's notation rocks. * Some XML tools/libraries/whatever manage not to understand namespaces correctly. * It's backwards-compatible with non-namespace-aware parsers. This is what makes XML namespaces broken and lets people avoid understanding them.

They're not even hard to get, you just have to understand a namespaced name is a shortcut for a pair of (namespace-uri, local-name), and that a namespace is scoped to the node it's declared on. And you're set, you're done with namespaces.

But instead, you have numbskull who tell you ElementTree 1.2 is broken because it has a very good handling of namespaces but doesn't let you configure a default namespace or customize namespace aliases on serialization (it just sets them as ``ns\d``).

Oh yeah, now that I think about it there is one completely broken thing about XML namespaces (as far as I'm concerned): an element with no specified namespace lives in the current default namespace (``xmlns=uri``), but an attribute with no specified namespace lives min the "null" namespace instead. That is stupid and annoying.


Namespaces, as an abstract idea, are great. But I find XML's version confusing in practice. It's partly the several different ways of configuring defaults (four I think), and partly the syntax.

You'd think they'd be as easy as Java:

  set-and-forget defaults (eg. java.util.*); or
  explicitly qualify everything (in import or in use).
  In practice, collisions are rare. 
Is it harder in XML because arbitrary XML documents are commonly nested? (which can't happen to Java source code)


> It's partly the several different ways of configuring defaults (four I think)

Which ones? I only know of using the xmlns attribute to set a default namespace, though as I mention above (in my formatting-broken-by-HN text) attributes playing by completely different namespace rules is... annoying, to say the least.

> Is it harder in XML because arbitrary XML documents are commonly nested?

That's about the only difference, and I don't think it matters much, in my opinion (nor do I think Java's namespaces are very good).

And XML namespaces are actually unique, unless you do very stupid things you can't have namespace collisions though you can have namespace-alias collisions (and as I noted above, using Clark's notation is a great way to understand how things actually work)


Actually, that IBM research went into a lot of products including System Z that can be configured to have several processors dedicated to XML parsing. The results are really impressive: http://www-03.ibm.com/systems/z/os/zos/features/xml/gxla1per...


They are missing the most important thing: AsmXml builds a specialized structure for a specific kind of XML. Regular parsers like Xerces-C build a generic, mutable DOM tree. That is obviously _way_ more work, with different lists for attributes and children, different node types etc.

Looking at the schemata they support, it doesn't look like they support all possible XML structures.

A better comparison might have been one of the XML binding tools vs AsmXml, that's much closer in functionality.

I'd also be careful wrt the XML-ness of this tool. Does it properly handle all Unicode obscurities? Does it handle all DTD obscurities? Is it robust against documents not matching the schema, malformed documents, or even malicious documents?


Those are exactly the questions I ask myself every time I hear about a new, super-fast XML parser. Is it a real XML parser[1] or can it only handle a "simple" subset of XML? I am surprised that more people don't seem to share the same skepticism as you and me. Probably because most people don't realize that implementing a conforming XML 1.0 parser is not a trivial task.

And you are right about the XML data binding tools. In most cases parsing XML itself is not what takes the bulk of the time. It is validation (against DTD, XML Schema, or ad-hoc), data conversion (e.g., from "123" string to 123 integer) and perhaps construction of some in-memory representation (with memory allocations that it involves) that take the bulk of the time. There are existing tools[2] that can performs the above tasks an order of magnitude faster than a general-purpose XML parsers by generating the tailor-made validation and data extraction code as well as data structures from XML Schema.

[1] http://www.codesynthesis.com/~boris/blog/2008/05/19/real-xml...

[2] http://www.codesynthesis.com/products/xsde/


A number of years ago, I wrote a XML/C binding tool atop the gnome SAX parser. It's able to handle anything in a DTD and a lot of things from XML Schema. I stopped working on it as it could do everything I had use for. http://xmel.sourceforge.net/ It does O(1) access to attributes as well, since it does binding.


> It's able to handle anything in a DTD and a lot of things from XML Schema.

Which would be great if those two weren't basically the worst schema languages available for XML.


The 10x speed difference is impressive, but they do list some caveats in the benchmark:

* AsmXml does not copy attributes and text unless necessary (when it includes entity or char references), so, if you need a zero terminated string, you must copy the value first.

* On the other hand, it does more work since it also decodes attributes (O(1) access time), saving a lot of time to the caller of the library.

I'd be interested to know whether it would still be that much faster than a C library that had the same tradeoffs e.g. no copying of attributes/text unless necessary.

I also wonder if it's worth it for existing C libraries to incorporate hand-optimised asm from this library (only when targeting x86 of course) if it provides much more benefit than the above tradeoffs.


The RapidXml library is pretty similar to AsmXml, except that it's written in C++. Browsing around the "Links" section on the AsmXml website, I came across a site that performs speed comparisons between the various parsers out there. RapidXml does appear to come pretty close to AsmXml in terms of speed.

Source: http://xmlbench.sourceforge.net/results/benchmark200910/inde...


Except that RapidXml is not a real XML parser. It is a parser that can only handle a subset of XML 1.0 language. See:

http://www.codesynthesis.com/~boris/blog/2008/05/19/real-xml...


Thank you for RapidXml information.

From RepidXml site: It is an in-situ parser written in modern C++, with parsing speed approaching that of strlen function executed on the same data.

bold statement :)


"* On the other hand, it does more work since it also decodes attributes (O(1) access time), saving a lot of time to the caller of the library"

This part I didn't understand. Does he mean transforming the document's encoding into the platform's native encoding? How can the access time of a list of attributes be in constant time? I mean, there needs to be some form of lookup from the name of the attribute to its value. And how do the two relate to each other?


The trick is that through their mapping, they have an array of "attributes" for every element, with a specific offset for each attribute with a certain name. During construction, they build that array according to the names, so they do the lookup once during construction.


But then how do you reference the attributes? I get the code to parse my xml document, then afterward I want to access a certain attribute (by name). The library would still need to look up the index of that attribute, at O(n) complexity, right?


no, quoting from grandparent: "a specific offset for each attribute with a certain name" -- e.g., for all <image> element, 'width' is always at attributes[0] and 'height' is at attributes[1], presumably regardless of whether that particular image has a width and height attribute.


Oh I see, thanks.


One of my cofounders committed this crime against nature: http://github.com/tow/fox

Yes, an XML parser in Fortran. Seriously, for data exchange between solid-state physics codes, writing this was the best option available.

Occasionally people ask us why we founded a startup. Seriously?!


For some reason, I'm impressed whenever anyone does anything clever in ASM, even if the performance benefits don't add up to much these days.


I know that at the least, NES, Super NES, and Sega Genesis games were nearly all programmed in assembler. I don't know much about the next generation of game systems. Whenever you want to feel impressed, go back and play some old school classics.


Modern games are written in huge swaths of C++, tiny bits of SIMD compiler intrinsics and individual pieces of very special-purpose assembly.

If you want to feel impressed, check this out http://benfry.com/distellamap/


This is brilliant. So brilliant I'm submitting it to the main page.


You're more likely to find lua or sometimes python in modern games than assembly. Scripting has become a big part of many games, and lua has found its niche by virtue of being relatively light weight and quite easy to interface with from C/C++.

I don't think I've seen any assembly that was actually enabled in the games I've worked on in the last five years, though I do know where there's a tiny bit in my current project that's #if'd for older environments. That's just stuff like atomic operations that every modern compiler provides built-ins for now.

The link is great, thanks!


That is superb.


PugXML (http://www.codeproject.com/KB/cpp/pugxml.aspx) uses similar approach but very portable.


I have used pugixml (based on PugXML) for very fast parsing of very simple XML and I can highly recommend it. The speed and (IMO) interface improvements are very large. http://code.google.com/p/pugixml/

In AsmXML's benchmark, ASM was 14x faster than Xerces (http://tibleiz.net/asm-xml/benchmark.html). In pugixml's benchmark, pugi was 11x faster than Xerces (http://zeux.mooo.com/pub/pugixml/table.html). Not bad for ASM vs C++. Both systems only support a limited subset of XML features.


Modern compilers can generate very good ASM (much better than most programmers can write by hand). I think parser with similar architecture in C should be as fast as written in pure ASM.


From the FAQ:

Assembly is fun.

Do you really think I would have written a parser in assembly in my spare time otherwise.


Can you supply references to compilers producing faster code than assembler written by hand? I'm under the impression that the programmer can still do better, if she's willing to put in the effort.


It's worth it to go through the slides from this talk by Felix von Leitner on Source Code Optimization: http://www.linux-kongress.org/2009/slides/compiler_survey_fe...

If you're writing an entire application in assembly for fun, then, sure, have at it. But if you're writing an entire application in assembly because you want better performance, then I think you're misguided. Write it in C, turn all optimizations on, run it with a bunch of benchmarks, profile it. Look to see where the bottlenecks are. Optimize those, maybe in assembly, or maybe by just being smarter with your C code.


Well, compiles these days can do incredibly crazy tricks to squeeze out every bit of performance. That being said, if a human knows these tricks as well, he can probably apply them even better than a compiler.


The compiler does a million things to speed up your code, which are mostly lost when you write pure assembly. I wouldn't be surprised if a c program was noticeably faster than the same thing in assembly...


You obviously haven't single stepped through the generated assembly from any modern compiler. Sure, some of it is reasonable stuff. But the majority is utter garbage. Fire up softice and try it.

Any half decent assembly programmer would beat compiled code.


Citation needed. I actually did the above (single stepped through the generated assembly from a few modern compilers) and I haven't noticed too much of a bad code, given most of the default assumptions (i.e. to have a reasonably compatible code for different processors, to have the live entry points to the functions that are exportable etc.)

Maybe you looked at the code produced with the optimizations off?


I've been reverse engineering code for a living for 10 years or so, and writing assembly code for 20 or so.

I don't have the time to provide examples I'm afraid, so take it or leave it ;)


I don't buy the argument from authority, I really think you're trolling.

Apart from the bad FPU code in gcc 3 (or generally relatively poor code in gcc 3 compared to more recent or more expensive compilers) I really don't know examples of significant problems in the code generation of modern C compilers. Just don't give me examples that work only on the latest CPU's or that are fast only on one of CPU generations or on the CPU of only one manufacturer.


Revealing quote from the bottom of the benchmarks page:

> I've performed other tests on different machines and observed that, on a Pentium 4, the difference between AsmXml and Expat is smaller (next generation processors seem to be more profitable to C programs). I also did tests with my Arch Linux but I couldn't build xerces-c so I don't publish results, but Expat was faster, may be because of 686 optimizations.

Says it all, really.


As far as I remember expat is event based parser (it emits events when next attribute, tag etc ready) but it seems that this parser creates some sort of DOM. DOM based parsers usually much slower.


..and more useful as well.


How good a job "most programmers" could do in general doesn't matter. What does is the programmer of the concrete software in question.


Why was this written? Seems like it would have an interesting story behind it.


now I want to see someone write a COBOL interpreter in Fortran


Using good (asm) to do evil (xml). Not cool :(


You think that programming a significant bit of code in assembly is a Good Thing? If you can possibly avoid it?

As for XML, it's excessively heavyweight and kind of awkward, but don't underestimate what a big improvement it was over the incompatible, non-human-readable binary formats that it was meant to replace.

No, this isn't using good to go evil. This is an elaborate, properly functioning, potentially useful joke. I'm awed by the combination of technical acumen and perversity, but I'm not going to actually use it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: