Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’m going to just argue the exact opposite of the article: xml and json are both structured data formats useful for tree like data graphs, such as objects.

Whether that was the intended purpose when xml was designed is irrelevant. It’s what xml is used for in almost every case.

The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s. Json really hasn’t been an alternative until quite recently (10 years ago?).

In fact reading the article carefully I fail to see the author argue why xml shouldn’t be used as a data format either.



The intended purpose is relevant because it tells us the conditions under which something is likely to work well.

XML was vastly overused for a long time. That doesn't make those usages correct, as there were alternatives even then. (It also doesn't make people who overused somehow bad; I think it was a reasonable and necessary mistake.) It certainly doesn't make new ones correct now given that JSON's been around 16+ years. [1]

I think the author here is slightly strong in his criticism; I think XML is great for things that are meant to be long-lived and self-documenting. That is, things that are used like documents. But if I'm passing short-lived globs of structured data back and forth, as with an API, I think JSON's a much better fit, as is Protobufs for more tightly joined code.

[1] https://web.archive.org/web/20030228034147/http://www.crockf...


> The intended purpose is relevant because it tells us the conditions under which something is likely to work well.

Not really. Plenty of things suck at their original intended purpose and remain in use because they are very good for some other purpose. (Viagra is an example well-known to popular culture, but hardly unique.)

> XML was vastly overused for a long time. That doesn't make those usages correct, as there were alternatives even then.

XML is perhaps not abstractly ideal for many of the purpose it has been used for, but in many cases it was superior to other alternatives for practical reasons, particularly the tooling ecosystem. (JSON is the new XML, and virtually the same thing can be said for JSON in many of its current uses, though it does clean up XMLs two biggest warts, element/attribute distinction and verbosity—though even for human readable formats YAML does the latter even more than JSON while being easier to read, not to mention all the binary options when readability isn't a concern.)


Yes, really. I agree there are exceptions, which is why I said "likely". But by and large, fitness for purpose correlates with design intent (which generally involves a significant period of iterative use, further driving fitness for purpose).


10 years ago the idea of having json anywhere but in the browser was crazy in the enterprise sphere


Sure, but the enterprise sphere is mostly what Moore called late majority and laggards. The biggest driver of use is not effectiveness, but perceived safety. A good proof of that is your choice of word here: crazy. It's not that JSON would have been somehow technically wrong; it was just socially wrong.


I'd argue that for data storage purposes, you'd like to have a low metadata to information ratio. In the examples the author gives this seems to be the main problem, with way more characters being used for markup than for content.

Compare that to JSON or TOML, which are more human-friendly and waste fewer bytes on structure to convey the same information. When used for data storage, two XML files of the same schema describing two completely different objects are likely to share a large amount of content, which is wasteful and gets in the way.


For storage of structured data (and probably even for loosely coupled RPC) you want format that is efficient and schema-oblivious. The bad choice 15 years ago was XML, bad choice today is JSON (the parsing overhead is not negligible) or ProtoBufs (not schema-oblivious). Various binary formats with JSON-like object model seems like the way to go (my choice is CBOR).

And then there is the EU-wide absurdity of WhateverAdES, which invariably leads to onion-like layers of XML in ASN.1 encoded as base64 in XML wrapped in CMS DER encoded message...


I beg to differ. For a start, XML compresses well and besides, storage is monster cheap these days. XML is a better storage format because it documents what the data is (a title, a reference etc) as well as the data itself.

There are many better reasons to hate XML.


XML does compress well as text or over the wire but the parsing trees can be quite large in memory and processing consumption. At least in Perl I've had enough scripts crash out due to this overhead when implementing the common/naive solution using off the shelf modules. You can get around this by choosing between DOM or SAX but I consider that a symptom of the problem, you choose XML to solve a problem and now you have another problem to solve.


I had the same problem with npm, I think, and JSON, because npm could not simply load the huge JSON file into memory. A huge anything can crash a naively written tool used to handle smaller instances.


That's true but I think XML has the edge there. It has so many features like defining new types which you wouldn't normally see in JSON. One parser we used had a ten to one ratio - 50MB of XML meant 500MB of RAM usage when using a DOM parser. And that's taking into account the textual representation of XML is already >50% bloat with the closing tags etc.


Well, XML is a markup language (and is really good at being that) while JSON is not. Sure, XML can be used as a poor man's data storage, as a base for a DSL, etc., but almost always there are better choices.


What are the better choices? And what were the better choices on the major platforms 10 years ago, the choice of which would have not seen every app use xml config files/dls/storage now?

I use csv when applicable. I use protobufs when applicable. But for the typical use case I choose xml for it's some config/dsl/dataset that needs to be human-editable (support comments, for example), more complex structure than csv supports, and preferably not need an external library or a custom parser. Json, Csv, Toml, S-Expressions, protobufs all fail one of more of these requirements. I'm sure there are others but none that don't have at least one drawback I don't want.

A poor man's data storage is exactly what I want!.


> preferably not need an external library or a custom parser. Json, Csv, Toml, S-Expressions, protobufs all fail one of more of these requirements.

And XML doesn't? Quite a few (not all, but quite a few nonetheless) programming languages include zero support for reading or writing XML-formatted data without using an external library or custom parser. This includes nearly all languages that predate XML, and quite a few languages that postdate it. Even when a language does have built-in (or at least in the standard library) support for XML, it's almost always a royal pain to use, especially once namespaces and schemas are involved.

Once upon a time, though, the answer was (and in a lot of places still is) INI:

- It's human-editable and supports comments

- It supports more complex structure than CSV

- Some languages have built-in support for it, and the Windows and GLib APIs support it, too (well, something similar enough to be compatible, in the latter case)

INI falls flat when you need to express deeper levels of nesting than keys and sections, though.

There's also YAML, which meets all your criteria about as well as XML does (at least on average; your specific language/platform might favor one or the other).


Right. Xml is to .NET what ini was for MFC. It’s what the platform “makes” you use. The same is true for json on js of course.

On a platform that has almost no support out of the box (e.g python) the choice is open. But on a platform that has a couple of formats built in, picking a format outside that platform is a pretty big step. The return needs to be substantial for a .net developer to use yaml via an external library over xml.

My reasoning in this thread has always started from the perspective that xml comes built in and almost no other format does. This is the case for e.g java and .net but not for python or C for example. But the prevalence of xml comes from java/.net so if we are to ask why, then we should consider that.


> There's also YAML

Omg YAML. Here's an example for you.

      - external:
        metric:
          name: kafka_consumer_group_lag
          selector:
            matchLabels:
              topic: rtb_trx_records
              consumer_group: trx-record-validator
        target:
          type: Value
          value: 30000
      type: External
It seems like the "type: External" should line up with "metric" and "target" but no, it needs to line up with the word "external" - not the dash, but the word after a space after the dash. Using YAML frequently reminds me of the quote "Be open minded, but not so open minded that your brains fall out".


I'm kinda surprised that's even valid YAML. I was under the impression that arrays and dicts can't mix like that.


Things like these should just be code in my opinion.


and having schemas is also great for some cases


Schemas are awesome for so many cases. Even though JSON Schema is a giant evolving clusterfuck, I still use it to be able to enforce some consistency.

And, being honest, JSON schema is better than, say, GPB or Avro schema at enforcing field relationships, e.g., "if typeId is 7, then partnerId cannot be null"


You aren't responding to the comment here, you are just reasserting the article's position. I'd argue there is not another format that is obviously better for every data storage or exchange use case, or that surpasses all of XML's benefits while minimizing all of its downsides. I don't want to look at XML, but I do understand why it is being used.


Abuse of XML killed it a format. JSON is absolutely shit for semantic markup, and yet developers today routinely use it for documents because "XML is bad". They contrive ridiculous schemes for adding metadata and type information. They use it to generate HTML even when HTML takes less space. Finally, we regressed from XTML to HTML5. Buy-buy namespaces and parsing consistency.


> They use it to generate HTML even when HTML takes less space.

The fact that MobileDoc exists makes me physically ill. Something that can be expressed with one line containing a paragraph element and an italic tag is over a dozen lines of JSON spam.


SGML being replaced by XML being replaced by JSON is great proof that the idea of progress in tech is at best a myth and at worst a lie.


I mean, the idea of progress in general is barely agreeable outside of recent strides in the physical sciences


Right, but, if given a choice of what to use, between XML and JSON, I'll pick JSON every time.

XML is a complete mess. Have you SEEN it's spec?

You can put JSONs spec in a single page. XMLs spec, not so much. Hell, most of the XML parsers don't support the spec, and the ones which do, historically have been riddled with security holes.

JSON over XML was simplicity over a crazy spec built by a bunch of companies all wanting to shove their own crazyness into it.



XML spec is longer than one page, granted, but it's about three times shorter than YAML spec. And XML spec describe not only the XML syntax, but also a basic form of validation (DTD), which include references. Basic XML has only five special symbols (<, >, ', ", and &) and can be parsed in linear time. (Namespaces complicate things somewhat.)


That's because json's spec isn't complete. It is predicated on the language interpreting it to be able to just eval the structure and work with the data [1]

1. http://seriot.ch/parsing_json.php


For a non human-edited data storage or exchange that’s fine. Json is worse for human editable data though. Xml might not be the best alternative there but it beats json for things like small configs.

It’s not as simple as saying “everywhere xml is used, json would be a better choice”.


Your data storage format shouldn't be human readable. The data transferred over the wire shouldn't be human readable. It should be a binary serialized data format, probably encrypted, definitely compressed.

Yes, storage is cheap. Bandwidth is not. Also, you really don't want a human that intercepts your data to be able to read it. Additionally, your data structue only makes sense in the context of your domain, which usually has been modeled in your program(s) that work in that domain, and thus it will be better if you deserialize it within tools that understand that domain.

If you feel the need for a general purpose deserialization protocol, there are several available - Avro/Protobuf, etc.

Binary encoded data can often be decoded without consuming the entire document. Sax-paparser-like reading requires at least reading an open and a close before the data is useful.

String serialization is a wasteful endeavor. It makes life easy for devs because it takes one less step to read the data in a text editor or log message, but quite often requires hacks to model things like recursive or self-referential data structures, and wastes space by repeating property names constantly for every item within the serialized structure. It's predicated on four of the fallacies of distributed computing, namely - bandwidth is infinite, the network is secure, transport cost is zero, and latency is zero. It is a solution looking for a problem, and because we are lazy, we don't build tools that would make binary serialized formats just as easy to use as json/yaml/xml.


Markup is a mix of scalar and structured data (it's a structure discovered or associated with a scalar) and thus it contains everything it needs to express structured data alone: just remove the scalar. E.g.

  <invoice id="123" customer-id="456" date="20199-10-30">
    <item no="1" product-id="789" qty="42" />
  </invoice>
Is this really a poor man's choice? And compared to JSON?! I can see at least the following advantages here:

1. Each element has explicit type name (invoice, item). JSON is "typeless", which simply means the type information travels out of band. And with XML namespaces these type names can be made globally unique, but still stay human readable.

2. Each element is self-contained, the code that produces the <item> doesn't need to know if there was an item before or after it so that it should add a separator. (The dangling comma problem in JSON.)

3. The attribute names are not just arbitrary strings as in JSON, there are strict rules of what can be in the name. They're much more suited for structured data than JSON, where you can name an attribute "foo.bar" and some JSON readers that accept a JSON "path" won't be able to find it.

4. It has less visual noise than JSON because the attributes don't have quotes around them and you don't need to separate elements with a special symbol. Despite the common belief well-written XML is more readable than JSON.

And we haven't event touched things like validation + extended types, references, and transformation of data.


Yet every time I've stumbled upon XML in the past decade or so It's been used as a data format because it's easy to manage and supported by every platform/tool out there. But sure, let's switch over to JSON or use a SQL database because we can't deal with the fact that XML might be better suited for something that it wasn't originally designed for.


It doesn’t answer the question, but I do wonder if XML would be an improvement in devops, compared to the current obsession with YAML. For everything except the part where you write it.

Make an xml stylesheet and your kubernetes cluster is instantly documented.


I recently entered DevOps (not my choice), and I would like to take your request further: replace everything with a regular language.

Having to use a gazillion declarative languages to achieve what a regular programming language does is simply crazy.


Here the next thing I hope will be code in the format you use for the apps themselves.

It’s testable, discoverable etc.

https://www.pulumi.com/

(Sorry about shameless plug, I’m not affiliated)


XML is pretty horrible to edit by hand.


Maybe if you are using Notepad. Any decent text editor will provide things like auto indentation, completion, auto end-tags, structured editing, and schema validation. For example, Emacs comes with nXML mode:

https://www.gnu.org/software/emacs/manual/html_mono/nxml-mod...


That assumes that you edit XML all day long. This is not always the case.

I am writing non-XML code most of the day, and I do not have structured editing / auto braces enabled. So when I need to edit that one XML config, I'll open it in my regular editor, which will provide at most syntax highlight, and edit it as needed with a bit of swearing. And next time, I would promise myself I'd choose a different config format which does not need special editors.


> I'll open it in my regular editor, which will provide at most syntax highlight, and edit it as needed with a bit of swearing.

That sounds like a very passive-aggressive way to deal with a problem. Do you do the same thing when writing programs?


Which "same thing"? Not setting up editor and complex environment for the things I am only going to edit once or twice? Yes.

In general, when you see something inefficient, you can either fix it to make it better, or ignore and come up with random workarounds.

In my opinion, a config file which cannot be edited by hand, and which needs a special editor with non-trivial learning curve, is a inefficiency. I can either ignore it, and set up the specialized tools; or I can fix it, by ripping out XML and replacing it with something more human-editable, like TOML or YAML. In large teams, it is almost always better to fix it -- sure, I will spend a few hours getting rid of XML, but this will pay itself off in the long term, as no one else will have to bother with special setup anymore.

(This obvious only applies to the systems where XML is a minor part, like a single configuration file. If your system has huge amount of XML, you better learn the right tools)


I don't understand, what is there to set up? With the Emacs mode I gave as an example, you just open an XML file and everything is there. Any decent text editor will have XML support.


xmllint --noout <file> will check the file and report any issues with XML syntax in a very detailed way with line numbers for you to see.

I myself don't even use syntax highlighting and normally work in vim and although I do make errors in XML sometimes, I find that I make at least as many syntactic errors in Python or C code that I have to weed out before I can proceed. But I never heard anyone complaining about Python or C being too strict :)


They all are, but xml isn’t the worst. (Json and S-expressions are worse, for example).

Not even formats designed for human consumption such as yaml are very good. The good ones for editing (toml, csv, ini) fall short when it comes to complex structure instead. There is no silver bullet.


> The author also doesn’t suggest what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s.

That’s easy: S-expressions. Steve Yegge wrote about this in 2005: https://sites.google.com/site/steveyegge2/the-emacs-problem

Would you rather have:

    <?xml version="1.0" encoding="utf-8" standalone="no"?>
    <!DOCTYPE log SYSTEM "logger.dtd">
    <log>
    <record>
      <date>2005-02-21T18:57:39</date>
      <millis>1109041059800</millis>
      <sequence>1</sequence>
      <logger></logger>
      <level>SEVERE</level>
      <class>java.util.logging.LogManager$RootLogger</class>
      <method>log</method>
      <thread>10</thread>
      <message>A very very bad thing has happened!</message>
      <exception>
        <message>java.lang.Exception</message>
        <frame>
          <class>logtest</class>
          <method>main</method>
          <line>30</line>
        </frame>
      </exception>
    </record>
    </log>
or:

    (log
     '(record
       (date "2005-02-21T18:57:39")
       (millis 1109041059800)
       (sequence 1)
       (logger nil)
       (level 'SEVERE)
       (class "java.util.logging.LogManager$RootLogger")
       (method 'log)
       (thread 10)
       (message "A very very bad thing has happened!")
       (exception
        (message "java.lang.Exception")
        (frame
         (class "logtest")
         (method 'main)
         (line 30)))))


I have no strong preference for either, so long as both of them convey the structured information I need.

Although tbh I'm preferring the XML in your example due to the lack of random quotes - why does (record need a quote, but (log doesn't?

The XML seems a lot more consistent.


The example seems pretty weird. If it's just data there shouldn't be any quotes at all. The only purpose of quoting is to prevent evaluation, so I guess the idea must be that `log` is a function call to be evaluated, but then the example isn't even the same thing as the XML version (and there are still nested quotes inside the already quoted record).


You'll want XML as soon as it exceeds one screenful :) And this XML is, well, "musused"; it should be that:

    <log>
      <record date="2005-02-21T18:57:39" millis="1109041059800" sequence="1" level="SEVERE" class="java.util.logging.LogManager$RootLogger" method="log" thread="10" message="A very very bad thing has happened!">
        <exception id="123" message="java.lang.Exception" class="logtest" method="main" line="30" />
      </record>
    </log>
I also don't object squeezing the exception attributes into record with some prefix that would make the names unique, like that:

    <record date="2005-02-21T18:57:39" millis="1109041059800" sequence="1" level="SEVERE" class="java.util.logging.LogManager$RootLogger" method="log" thread="10" message="A very very bad thing has happened!" exc-message="java.lang.Exception" exc-class="logtest" exc-method="main" exc-line="30" />
    
Or, if it's possible to normalize these data, factor out the exception and other attributes and build an index of them so that they can be referenced by an ID:

    <exception id="123" message="java.lang.Exception" class="logtest" method="main" line="30" />
    
    <record ... exception-id="123" />


The S-expressions are misused too — Steve Yegge’s example, not mine. My version would have no use of QUOTE at all.


The XML version has the dtd which can be used to validate it though. I am not aware of something similar for s-expressions.


Frankly, I’d rather some form of compressed binary logs, but sure, either works if it includes a stack trace and I need it at the time! Text logs compress well anyway, they just grow larger if/when you process them later... I’m just happy to see objects as log entries instead of plaintext strings. ;)


The xml, definitely. Without lisp-specific tooling that colours/balances etc, writing 5 consecutive closing parens (and not 4 or 6) is a headache. Also I probably couldn’t choose it on its own merits - I choose what the non programmers that edit the file can use (and they sure don’t have any other editor than notepad).

Then again: I’d usually only ever choose between formats with support already on the platform/standard library, if it was just some config or small data file. If the data is core to the product then of course it might be reasonable to include a new library or even write a parser. I’m talking java and .net now mainly.


> what should be used instead to encode structured data, or perhaps more importantly what should have been used to encode graph like things such as map/lists/objects in the 2000’s

Lisp?


I think you mean s-expressions instead of lisp and if you do, I'm with you. They are really neat and underused. Now, if you want to carry a whole language runtime for the structured data format you'll probably end up with the "which lisp" kind of problem in which each app end up using something different and then the structured data files are no longer exactly portable (until you devise a portable standard yadayadayada)


I have to admit, once you've seen S-expressions, XML looks insanely bloated and verbose.

The advantage XML would have in that situation is that, because it's so verbose, if you get a malformed XML, you can eyeball-parse it and often figure out how to hand-edit it to make it valid. If it is valid, you can also see exactly how the schema you were sent differs from the schema you expected. An S-expression, having less redundancy, also is potentially more brittle.


If malforming is regular (like someone printf’d or typed in bad xmls), then the same is true for any human-readable format, since you know data and what it should look like. If not, (like in randomly broken packet), then you solve the problem at the wrong level. By-hand error-correcting verbosity is hardly a selling point of the application level protocol.


> By-hand error-correcting verbosity is hardly a selling point of the application level protocol.

It's a selling point while you're trying to figure out how to get it working. Once you have it working, it's not - but by then, you have it working, so why change it?


Yes! I meant s-expressions, the name escaped me - thanks.


> Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

> ASN.1 is a joint standard of the International Telecommunication Union Telecommunication Standardization Sector (ITU-T) and ISO/IEC, originally defined in 1984...

https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One


I think ASN.1 was designed for a world with (a) waterfall development and (b) static deployments - ie no over the air updates. Under the circumstances, messing up is simply not an option - hence defining the standard so clearly and for so many use cases.

Today, of course, we treat the entirety of our deployed infrastructure as 'merely' a platform to write code. And not only are experimentation and failure OK, they're positively encouraged. Velocity became important.


You can meaningfully decode any DER encoded ASN.1 structure and serialize it back without any knowledge of the schema. Somewhat surprisingly you cannot do that with all instances of XML documents.


The first thing wrong is that you have to serialize and deserialize it. Operationally, that's inconvenient, and it shows that they're optimizing for network bandwidth. But these days, squeezing the most out of each bit is, in most cases, not a defensible design decision.

Then, once you deserialize it, it's still a printable version of ASN.1. Sure, it's unambiguous, rigidly defined, and standardized. It's still gouge-your-eyeballs-out horrible to try to do anything with.

Say you get an XML message over the wire with a bit flipped. If you look at it, you have a good chance to be able to figure out what went wrong, edit one character, and you can now process it. If you get an ASN.1 message in the same condition, it's pretty much game over (though there may be special tools that could save you).

Say you get an XML, and you don't know the schema. You look at it, and you can see what's going on. You get an ASN.1 where you don't know the schema, and you can be totally sunk. (If I recall correctly, in ASN.1, you can have schemas that are private, that is, not specified in the standard.)


XML (and JSON) have the advantage of names, which makes it slightly easier when it comes to querying (and indeed building indexes) over lots of data. I'd be amazed if there wasn't tons of work on this for S-expressions, but I can imagine it's slightly clunky.


What do you mean by names here? S-expressions have symbols, which serve the exact same purpose as what I think you might mean: an interned string value which can be cheaply used more than once.


Positions within lists aren't named. Sure you can do:

((foo . (bar . baz)) (foo . (bar . baz)) (foo . (bar . baz)) (foo . (bar . baz)))

And say "index all the foos", but you're mixing the structure and the content, in a way that JSON and XML explicitly separate.


Not saying you can't specify some schema here, but there's nothing native to S-expressions that makes it quite as transparent and simple to specify a path into a data structure.


All you need on .NET/Java in 2007 then is a lisp parser/interpreter. I know they are easy to write, but they aren't as easy to write as something you don't have to write at all.

Let's not forget, many one of these configuration files and data formats are one-off hacks that were meant to be replaced by a real format, a real parser, a real DSL etc. The reason the xml config/dsl/format stuck is because it worked. And it was cheap and easy.


S-expressions have been around since the 1960s. McCarthy was already proposing S-expressions for what XML would become, in 1975. Stop whining and finding excuses, and start using s-expressions.

http://www-formal.stanford.edu/jmc/cbcl2/cbcl2.html


I'm not trying to pick the best format for the job, I'm usually trying to pick the least bad one that's available IN the platform/standard library I'm using.


That this answer is voted down, on HN of all places, is kinda sad.

It may not be the popular answer, but it’s definitely a valid answer.


It was because I said Lisp instead of s-expressions - I find imprecise language is always a good way to garner down-votes on HN




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: