He stores attributes in an array, to give O(1) access (... if you assume attribute order is significant, which it isn't according to the XML spec and most tools... even though humans find them more readable if in an expected order. So, to find a specific attribute, you'll have to step through them.)
XML parsers are unbelievably inefficient.. (but c'mon... it's XML... all other sins pall to insignificance before the big one, as fireflies at dawn.) He makes the nice point that once you have the XML parsed, you can just use those strings directly instead of copying them again.
IBM research came out with a this idea a while back, of combining the two, but it doesn't seem to have gone anywhere. I'd guess they patented it like crazy, because, after all, they are IBM. http://portal.acm.org/citation.cfm?id=1135777.1135796 (that's just the abstract - sorry, couldn't find the pdf.)
But again, all this cleverness wouldn't matter except that it's fun (if you really need speed, do not use XML.). Oh... and except in "XML appliances": http://en.wikipedia.org/wiki/XML_appliance
DISCLAIMER: despite my tone, I am pro-XML. It's very useful. Some of its limitations facilitate some parsing ideas that escaped everyone else because of their cost in terms of performance and expressiveness. Just as some people are so obsessed with performance that they miss modular elegance, some people are analogously obsessed with expressiveness. There! I think I've insulted everybody.
> My biggest gripe with XML is namespaces followed by XSLT.
XSLT I agree.
Namespaces I quite like, the 3 issues I have being:
* I can't write namespaced documents in Clark's notation. Clark's notation rocks.
* Some XML tools/libraries/whatever manage not to understand namespaces correctly.
* It's backwards-compatible with non-namespace-aware parsers. This is what makes XML namespaces broken and lets people avoid understanding them.
They're not even hard to get, you just have to understand a namespaced name is a shortcut for a pair of (namespace-uri, local-name), and that a namespace is scoped to the node it's declared on. And you're set, you're done with namespaces.
But instead, you have numbskull who tell you ElementTree 1.2 is broken because it has a very good handling of namespaces but doesn't let you configure a default namespace or customize namespace aliases on serialization (it just sets them as ``ns\d``).
Oh yeah, now that I think about it there is one completely broken thing about XML namespaces (as far as I'm concerned): an element with no specified namespace lives in the current default namespace (``xmlns=uri``), but an attribute with no specified namespace lives min the "null" namespace instead. That is stupid and annoying.
Namespaces, as an abstract idea, are great. But I find XML's version confusing in practice. It's partly the several different ways of configuring defaults (four I think), and partly the syntax.
You'd think they'd be as easy as Java:
set-and-forget defaults (eg. java.util.*); or
explicitly qualify everything (in import or in use).
In practice, collisions are rare.
Is it harder in XML because arbitrary XML documents are commonly nested? (which can't happen to Java source code)
> It's partly the several different ways of configuring defaults (four I think)
Which ones? I only know of using the xmlns attribute to set a default namespace, though as I mention above (in my formatting-broken-by-HN text) attributes playing by completely different namespace rules is... annoying, to say the least.
> Is it harder in XML because arbitrary XML documents are commonly nested?
That's about the only difference, and I don't think it matters much, in my opinion (nor do I think Java's namespaces are very good).
And XML namespaces are actually unique, unless you do very stupid things you can't have namespace collisions though you can have namespace-alias collisions (and as I noted above, using Clark's notation is a great way to understand how things actually work)
They are missing the most important thing: AsmXml builds a specialized structure for a specific kind of XML. Regular parsers like Xerces-C build a generic, mutable DOM tree. That is obviously _way_ more work, with different lists for attributes and children, different node types etc.
Looking at the schemata they support, it doesn't look like they support all possible XML structures.
A better comparison might have been one of the XML binding tools vs AsmXml, that's much closer in functionality.
I'd also be careful wrt the XML-ness of this tool. Does it properly handle all Unicode obscurities? Does it handle all DTD obscurities? Is it robust against documents not matching the schema, malformed documents, or even malicious documents?
Those are exactly the questions I ask myself every time I hear about a new, super-fast XML parser. Is it a real XML parser[1] or can it only handle a "simple" subset of XML? I am surprised that more people don't seem to share the same skepticism as you and me. Probably because most people don't realize that implementing a conforming XML 1.0 parser is not a trivial task.
And you are right about the XML data binding tools. In most cases parsing XML itself is not what takes the bulk of the time. It is validation (against DTD, XML Schema, or ad-hoc), data conversion (e.g., from "123" string to 123 integer) and perhaps construction of some in-memory representation (with memory allocations that it involves) that take the bulk of the time. There are existing tools[2] that can performs the above tasks an order of magnitude faster than a general-purpose XML parsers by generating the tailor-made validation and data extraction code as well as data structures from XML Schema.
A number of years ago, I wrote a XML/C binding tool atop the gnome SAX parser. It's able to handle anything in a DTD and a lot of things from XML Schema. I stopped working on it as it could do everything I had use for.
http://xmel.sourceforge.net/
It does O(1) access to attributes as well, since it does binding.
The 10x speed difference is impressive, but they do list some caveats in the benchmark:
* AsmXml does not copy attributes and text unless necessary (when it includes entity or char references), so, if you need a zero terminated string, you must copy the value first.
* On the other hand, it does more work since it also decodes attributes (O(1) access time), saving a lot of time to the caller of the library.
I'd be interested to know whether it would still be that much faster than a C library that had the same tradeoffs e.g. no copying of attributes/text unless necessary.
I also wonder if it's worth it for existing C libraries to incorporate hand-optimised asm from this library (only when targeting x86 of course) if it provides much more benefit than the above tradeoffs.
The RapidXml library is pretty similar to AsmXml, except that it's written in C++. Browsing around the "Links" section on the AsmXml website, I came across a site that performs speed comparisons between the various parsers out there. RapidXml does appear to come pretty close to AsmXml in terms of speed.
"* On the other hand, it does more work since it also decodes attributes (O(1) access time), saving a lot of time to the caller of the library"
This part I didn't understand. Does he mean transforming the document's encoding into the platform's native encoding? How can the access time of a list of attributes be in constant time? I mean, there needs to be some form of lookup from the name of the attribute to its value. And how do the two relate to each other?
The trick is that through their mapping, they have an array of "attributes" for every element, with a specific offset for each attribute with a certain name. During construction, they build that array according to the names, so they do the lookup once during construction.
But then how do you reference the attributes? I get the code to parse my xml document, then afterward I want to access a certain attribute (by name). The library would still need to look up the index of that attribute, at O(n) complexity, right?
no, quoting from grandparent: "a specific offset for each attribute with a certain name" -- e.g., for all <image> element, 'width' is always at attributes[0] and 'height' is at attributes[1], presumably regardless of whether that particular image has a width and height attribute.
I know that at the least, NES, Super NES, and Sega Genesis games were nearly all programmed in assembler. I don't know much about the next generation of game systems. Whenever you want to feel impressed, go back and play some old school classics.
You're more likely to find lua or sometimes python in modern games than assembly. Scripting has become a big part of many games, and lua has found its niche by virtue of being relatively light weight and quite easy to interface with from C/C++.
I don't think I've seen any assembly that was actually enabled in the games I've worked on in the last five years, though I do know where there's a tiny bit in my current project that's #if'd for older environments. That's just stuff like atomic operations that every modern compiler provides built-ins for now.
I have used pugixml (based on PugXML) for very fast parsing of very simple XML and I can highly recommend it. The speed and (IMO) interface improvements are very large. http://code.google.com/p/pugixml/
Modern compilers can generate very good ASM (much better than most programmers can write by hand). I think parser with similar architecture in C should be as fast as written in pure ASM.
Can you supply references to compilers producing faster code than assembler written by hand? I'm under the impression that the programmer can still do better, if she's willing to put in the effort.
If you're writing an entire application in assembly for fun, then, sure, have at it. But if you're writing an entire application in assembly because you want better performance, then I think you're misguided. Write it in C, turn all optimizations on, run it with a bunch of benchmarks, profile it. Look to see where the bottlenecks are. Optimize those, maybe in assembly, or maybe by just being smarter with your C code.
Well, compiles these days can do incredibly crazy tricks to squeeze out every bit of performance. That being said, if a human knows these tricks as well, he can probably apply them even better than a compiler.
The compiler does a million things to speed up your code, which are mostly lost when you write pure assembly. I wouldn't be surprised if a c program was noticeably faster than the same thing in assembly...
You obviously haven't single stepped through the generated assembly from any modern compiler. Sure, some of it is reasonable stuff. But the majority is utter garbage. Fire up softice and try it.
Any half decent assembly programmer would beat compiled code.
Citation needed. I actually did the above (single stepped through the generated assembly from a few modern compilers) and I haven't noticed too much of a bad code, given most of the default assumptions (i.e. to have a reasonably compatible code for different processors, to have the live entry points to the functions that are exportable etc.)
Maybe you looked at the code produced with the optimizations off?
I don't buy the argument from authority, I really think you're trolling.
Apart from the bad FPU code in gcc 3 (or generally relatively poor code in gcc 3 compared to more recent or more expensive compilers) I really don't know examples of significant problems in the code generation of modern C compilers. Just don't give me examples that work only on the latest CPU's or that are fast only on one of CPU generations or on the CPU of only one manufacturer.
Revealing quote from the bottom of the benchmarks page:
> I've performed other tests on different machines and observed that, on a Pentium 4, the difference between AsmXml and Expat is smaller (next generation processors seem to be more profitable to C programs). I also did tests with my Arch Linux but I couldn't build xerces-c so I don't publish results, but Expat was faster, may be because of 686 optimizations.
As far as I remember expat is event based parser (it emits events when next attribute, tag etc ready) but it seems that this parser creates some sort of DOM. DOM based parsers usually much slower.
You think that programming a significant bit of code in assembly is a Good Thing? If you can possibly avoid it?
As for XML, it's excessively heavyweight and kind of awkward, but don't underestimate what a big improvement it was over the incompatible, non-human-readable binary formats that it was meant to replace.
No, this isn't using good to go evil. This is an elaborate, properly functioning, potentially useful joke. I'm awed by the combination of technical acumen and perversity, but I'm not going to actually use it.
> Remember: if you really need speed, do not use XML.
He defines his own schema language, in XML (not XML Schema), which the parser needs. Nothing wrong with that. You could even transform (some subset of) XSD's to it using XSLT. (see "2. defining the schema" http://tibleiz.net/asm-xml/tutorial.html and "defining the schema" http://tibleiz.net/asm-xml/documentation.html)
He stores attributes in an array, to give O(1) access (... if you assume attribute order is significant, which it isn't according to the XML spec and most tools... even though humans find them more readable if in an expected order. So, to find a specific attribute, you'll have to step through them.)
XML parsers are unbelievably inefficient.. (but c'mon... it's XML... all other sins pall to insignificance before the big one, as fireflies at dawn.) He makes the nice point that once you have the XML parsed, you can just use those strings directly instead of copying them again. IBM research came out with a this idea a while back, of combining the two, but it doesn't seem to have gone anywhere. I'd guess they patented it like crazy, because, after all, they are IBM. http://portal.acm.org/citation.cfm?id=1135777.1135796 (that's just the abstract - sorry, couldn't find the pdf.)
But again, all this cleverness wouldn't matter except that it's fun (if you really need speed, do not use XML.). Oh... and except in "XML appliances": http://en.wikipedia.org/wiki/XML_appliance
DISCLAIMER: despite my tone, I am pro-XML. It's very useful. Some of its limitations facilitate some parsing ideas that escaped everyone else because of their cost in terms of performance and expressiveness. Just as some people are so obsessed with performance that they miss modular elegance, some people are analogously obsessed with expressiveness. There! I think I've insulted everybody.