Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, what is XML good for? If it's not good for data as everyone says (and I'm not inclined to argue), but it is good for documents, what kind of documents are we referring to? A defined metadata on a text document? A template used with data to generate something else? Is a configuration file a document or data? Where would I want to use XML that something like JSON, a text document, or some combination thereof wouldn't be better?

I'm not being facetious, this is an honest question. Where are the "right" places to use XML?



> Where are the "right" places to use XML?

Tim Bray, co-editor of the XML spec, writing in 2006 on the topic:

> Use JSON: Seems easy to me; if you want to serialize a data structure that’s not too text-heavy and all you want is for the receiver to get the same data structure with minimal effort, and you trust the other end to get the i18n right, JSON is hunky-dory.

> Use XML: If you want to provide general-purpose data that the receiver might want to do unforeseen weird and crazy things with, or if you want to be really paranoid and picky about i18n, or if what you’re sending is more like a document than a struct, or if the order of the data matters, or if the data is potentially long-lived (as in, more than seconds) XML is the way to go.

* https://www.tbray.org/ongoing/When/200x/2006/12/21/JSON

He was also editor of the JSON RFCs:

* https://tools.ietf.org/html/rfc7159

* https://tools.ietf.org/html/rfc8259


I wrote this four years ago: https://news.ycombinator.com/item?id=11446984

There are points of disagreement between me and the author, although I wouldn't get too passionate about them.

Super-short version, reading over it again, is that XML is very good at what it does, but it really ought to be seen as a relatively specialized data format. It's really good at certain tasks, best-of-breed for a couple of them, and degrades rapidly as you get away from that. JSON is a fairly cheap & fast general-purpose format that's OK at a lot of things, isn't necessarily great at much, but as you get into more specialized use cases, also tends to degrade. Being a general-purpose format, perhaps arguably it degrades more "slowly", but it does degrade.

Properly understood, IMHO, their use cases don't overlap much if at all, and the combination of them may cover a lot of space, but are still far, far from the only serialization formats you'll ever need.


Just the sort of thing you would think of as 'documents'--the texts of books, manuscripts, and the like, where structure may be somewhat arbitrary. For instance, I work with a few different text corpuses--one of which is an actual dictionary, with entries, definitions, usage examples, etymological information, and bibliographic references. Another is a collection of poetry manuscripts, with annotations for line breaks and editorial emendations, both from the author and other editors (i.e, places in the manuscript with crossouts, interlineal notes, marginal notes, etc).

I mean, in theory, you could do this in JSON or some other data structure. But you would go insane and be shooting yourself in the head before long.


> you could do this in JSON or some other data structure

I'm not sure you could. For example, in another comment, I mentioned DocBook[1]. How would you do the following sample document in JSON?

  <?xml version="1.0" encoding="UTF-8"?>
  <book xml:id="simple_book" xmlns="http://docbook.org/ns/docbook" version="5.0">
    <title>Very simple book</title>
    <chapter xml:id="chapter_1">
      <title>Chapter 1</title>
      <para>Hello world!</para>
      <img src="hello.jpg"/>
      <para>I hope that your day is proceeding <emphasis>splendidly</emphasis>!</para>
    </chapter>
    <chapter xml:id="chapter_2">
      <title>Chapter 2</title>
      <para>Hello again, world!</para>
    </chapter>
  </book>
Would you make each <chapter> into an object? But you have 2 <para> children in there with an <img> in between. And one <para> has an additional <emphasis> in the content. I can't think of a good JSON schema equivalent to this.

[1] https://en.wikipedia.org/wiki/DocBook#Sample_document


you could always do an S-expression-esque DSL in JSON ;)

  ['book', {'id': '...'},
    ['title', {}, ...],
    ['chapter', {'id': 0},
      ['title', {}, 'Chapter 1'],
      ...
    ],
    ['chapter', {'id': 1},
      ...
    ],
  ]
more realistically, you could just represent it with the AST of that XML, i.e

  {
    'type': 'book',
    'attrs': {'id': ...},
    'children': [
      {
        'type': 'title',
        'children': ['Simple book']
      },
      {
        'type': 'chapter',
        ...
      },
      {
        'type': 'chapter',
        ...
      },
      ...
    ]
  }

so you could do that emphasis bit as

  [
    'this text nees more', 
    {'type': 'emphasis', 
     'children': ['emotion']},
    '!'
  ]
hellish to write by hand but probably okay for a program to consume (modulo all the XML libs/tooling you can't use). and you could probably even write some kind of schema for it.

if i actually had to represent that data, i'd also move some child nodes into attributes, e.g. make all nodes with 'type': 'book' also have a 'title' attribute, like you would if you had an AST datatype


See https://developers.google.com/docs/api/samples/output-json for what Google Docs does - basically separating markup from the text by using indices.

which is probably the only way to properly deal with markup and especially commented sections that can span over paragraph start/ends - neither JSON or XML seems to have a proper answer for such annotations and I wonder if there's any standard format that can that, especially if humans still want to reasonable be able to view or edit iit...

(OOXML and its binary equivalents more or less solve this by completely separating paragraph and character formatting, both separately indexing the spans of text they annotate)


That is what essentially every WYSIWYG text processor does. And also the reason why getting sane HTML out of text processor is somewhat non-trivial, as the separately indexed spans can very well overlap, contradict each other or contain completely unnecessary formatting information.


Potential option:

  {
    "id": "simple_book",
    "title": "Very simple book",
    "chapters": [
      {
        "id": "chapter_1",
        "content": [
          { "type": "title", "value": "Chapter 1" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "Hello World!" }
            ]
          },
          { "type": "img", "src": "hello.jpg" },
          {
            "type": "para",
            "content": [
              { "type": "text", "value": "I hope that your day is proceeding " },
              { "type": "emphasis", "value": "splendidly" },
              { "type": "text", "value": "!" }
            ]
          }
        ]
      }
    ]
  }


But as pointed out in the article, JSON isn't necessarily going to guarantee the correct order of your nested bits. Your code is going to have to worry about that. And it will quickly become unmanageably complex. When you are for instance creating a marked up transcript of some archival material, there's a lot of human editing involved. Have a look at the TEI documentation to see how messy it can get.


Certainly. I wasn't suggesting that JSON representation I put up there was actually a good idea, just that it's theoretically possible to represent that document as JSON.


I absolutely agree. Where XML shines is when you could take just the text content - strip out the all the markup elements and attributes, document, comments, etc - and have a text document that still makes some sort of sense.

This AFAICT was actually why SVG has a few bizarre choices - such as putting all the drawing commands into attributes. A browser that doesn't understand an embedded SVG document in its HTML would be left with just the text contents.


A right place to use XML is when you want to write structured & reusable content, for example when writing documentation. See DITA [1].

[1] https://en.wikipedia.org/wiki/Darwin_Information_Typing_Arch...


I've been getting along fine using JSON for pretty much everything. That being said XML has some very sophisticated features like rigorous schema definition, a query language, a formal include syntax, comments (that's a big one), it's a lot easier to do multi-line content and in fact you can mix normal text and structured data.

The include syntax doesn't get enough love. It's crazy that JSON doesn't support it.


The issue with all the XML sophistication is that essentially only environment where all that really works is when you use XML as a markup language for technical publishing (ie. DocBook, DITA, ...) and in fact as a less convenient but more modern and cool SGML replacement.

Random applications that read XML just aren’t going to implement fully validating parsers, because it is lot of completely unnecessary work. Also in the article mentioned pattern of storing everything in attributes mostly comes from the fact that working with CDATA nodes in XML is major PITA wrt. whitespace handling and coalescing adjacent nodes.


Think of semi-structured documents, where you have a list of pre-defined sections. I've seen it in use by insurance companies for case reports, in real estate for appraisals. And of course we've all seen it work well in the form of HTML. There's some structure to all of these examples, but mostly for annotating text sections. You need some flexibility built into the schema to add fields as needed, but you're not dealing with various map / list / primitive data types as a matter of routine. Just making this one up, but if LaTeX wasn't already the standard, I'd also use it if I was digitizing the content academic papers, for instance. You have an a header with metadata, abstract, the body, citations. There's some structure, a need to add some metadata, perhaps flexibly over time, but mostly it's just a document.


DocBook[1] is probably another good use of XML.

[1] https://en.wikipedia.org/wiki/DocBook#Sample_document


What is the exact relationship? The history as I remember it is:

1. SGML (Simple Generalised Markup Language) came first. 2. HTML was a specialisation of SGML, it took off because of the web, and is probably the only reason for SGML to become famous. 3. XML was then invented as a generalisation of HTML, perhaps by people who had never heard of SGML.

And I seem to remember DocBook is an SGML thing, it was invented between steps 2 and 3.


That's completely wrong. XML is specified as a subset of SGML (it says so in the preamble even) by folks who where also involved in specifying SGML. Moreover, these same folks (the "Extended Review Board" at W3C) also amended SGML to align with the XML profile of SGML in ISO 8879 Annex K aka the "WebSGML adaptions".

Also: SGML = standard generalized markup language


The sample on the Wiki page is an XML, though. If you look at the DocBook spec's intro[1], it says:

> DocBook is general purpose [XML] schema

[1] http://docs.oasis-open.org/docbook/docbook/v5.1/os/docbook-v...


DocBook is originally an SGML DocType and most of the DocBook formatters are written in DSSSL. Large amounts of documentation for open source software (and large amounts of O’Reilly books) is still SGML DocBook.


I worked on a project that used XML as document format for config files.


XML for config files is utter madness.

Why use XML when INI files are way easier to read, and especially edit if needs be?

Why add a ton of additional sugar on top of something as simple as "a=b" for config files?


When you have much more than one line of these "a=b", that sugar helps. XML has hierarchy, you can group related values into elements. XML has comments. XML can be typed, the standard even defines formats for numbers, date and time. There're good libraries to serialize/deserialize objects to XML, from pretty much all OO languages out there. I use them a lot, and I rarely expect users to edit XML, I give them GUI to change the settings they need, updating the config.


Well, the problem with INI files for configuration is that config files (legitimately!) need to be able to represent repetition, nesting, schemas and comments, which there was never any standardization for. While XML seems like overkill for something as mundane as a config file, the standard does at least cover all of the cases you need.


If your config is that complex, you might be better served by JSON - or the full JS or, say, Lua, for that matter. Because what you are talking about looks more like (interpreted) code than a config.


It's under- and mis-used, but XSD helps to validate config files and provide some guidance on the structure. I know JSON has a schema in draft, is it used much?

An example of XML config we used at my workplace was a processing pipeline with various modules and options/parameters encoded for each phase (some optional) of the pipeline. So in a sense it was configuration that resulted in executing code modules, not so much your standard options.


It was for rather complicated nested data- and view-definitions


Sounds like a - gasp! - data format.


XML is absolutely excellent for markup. There's no competitors here.

Markup consists of two things: a scalar (a string usually, but can be a binary sequence) and associated structured data: smaller scalars, records with fields, and lists (there's no "etc." here, that's all).

The structured data are either discovered in the scalar by parsing or added to it by marking it up. Parsing applies to binary data and artificial languages (although there are parsers for natural languages as well), marking up to structures that cannot be parsed out, but can be added manually, usually during authoring, but also during after-the-fact indexing.

XML stores both the original scalar and the structure together in a single piece. There's extensive tooling for processing the result.

Practical examples:

1. Parse a C file and do something else with it than compiling. E.g. you want to publish it, index with cross-references, transform maybe: XML shines here (you'll normally want to add XSLT to it).

2. Author text and do something with it. If it's Markdown, apply a minimal parsing and save the resulting AST in XML. Same for reST and any other format out there: just get it into XML as soon as you can and process the XML from that point. Whatever you want to produce (XML, man pages, PDFs), XML toolchain will help you to get there.

3. Mark up existing text. E.g. you have a collection of letters and want to index all references to people. XML would be a very good choice here too. (I'd say that marking up and indexing all existing texts of the humanity would be a very important project. There's already a lot of effort to publish them, and marking up and indexing is what naturally comes next.)

4. I'd venture to say that even binary formats would benefit from conversion to XML and back because of what's possible with XML toolchain (I'm thinking mostly about transformation, but indexing would also be good.) E.g. read a collection of MP3 files, parse them out into what they have (ID3 tags of different versions at the beginning or end, APE tags, other such tags, and MPEG frames), and then do what you want: index by anything, clean up, add extra information that cannot be expressed in tags (classification for classical music or argentine tango, for example) and so on.

PS: Since XML can store structures alongside a scalar, it can also store structures alone: just drop the scalar. It's a very good format for structured data, absolutely not as bad as it's usually painted. Much better than JSON, actually. But you have to prepare it well.

PPS: Scalars and structured data are, of course, the natural parlance of all other programming languages out there, so everything XML does you can do without XML. But it also means that XML is not as foreign as it appears. There is some friction between getting data out of XML and putting it back, but it's about same as with SQL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: