The Essence of XML


Phil Wadler is one of Computer Science's deepest thinkers in the area of programming language theory. I've been a longtime fan of his work. He presented a paper at this year's POPL (Principals of Programming Languages) entitled The Essence of XML. Phil says some controversial things, among them:

The essence of XML is this: the problem it solves is not hard, and it does not solve the problem well.

Don't let that stop you from reading the article. Phil and his team have developed a formalization of XML Schema which is quite elegant. This formal semantics is part of the official XQuery and XPath specification, one of the first uses of formal methods by a standards body.

Wadler makes his judgment about XML not being particularly good at what it does because of its shortcomings with respect to two properties that a data representation language ought to provide:

  • Self-describing: From the external representation one should be able to derive the corresponding internal representation.
  • Round-tripping: If one converts from an internal representation to the external representation and back again, the new internal representation should equal the old.

With respect to these properties, Phil says:

XML has neither property. It is not always self-describing, since the internal format corresponding to an external XML description depends crucially on the XML Schema that is used for validation (for instance, to tell whether data is an integer or a string). And it is not always round-tripping, since some pathological Schemas lack this property (for instance, if there is a type union of integers and strings).

Of course, as Phil points out, LISP s-expressions have both of these properties. This doesn't necessarily imply that s-expressions would be a good substitute for XML. One of XML's great features is that its parsers work as interpreters rather than being compiled. That is, they update their syntax on the fly as they work rather than having a syntax compiled in, as is the case with s-expressions or other representations.

The paper introduces, very early, a theorem characterizing validation in terms of erasure and matching. The theorem is easy to understand showing how validation takes an external value into an internal value, and how erasure takes an internal value into an external value. With this model, a lot of thorny issues regarding XML Schema are more readily described and defined. As the introduction to the paper says:

XML Schema is a large and complex standard - more than 300 pages in printed form. The main difficulty was to understand that the essence of XML Schema lies in the way type names and structures interact. Our first surprise has been to realize that once we captured named typing and validation, most of the myriad other features of XML Schema fit neatly into the simple framework presented here. ... Our second surprise has been to realize that, despite XML Schema's complexity, the resulting theorem turns to be simple.

If you've been following the discussion on RDF and grounding, then this is a paper I recommend you read. The mathematics is quite accessible and you'll have a deeper understanding of XML Schemas when you're done.