Learning from the Web


I've heard Adam Bosworth talk about the lessons we should have learned from the web about semi structured data, but it didn't really hit home for me until I read his arguments in this article in Queue. I read it in hardcopy 6 weeks ago or so and it's finally online so I can link to it. He lists five "unintuitive" lessons:

  1. Simple, relaxed, sloppily extensible text formats and protocols often work better than complex and efficient binary ones.
  2. It is worth making things simple enough that one can harness Moore's law in parallel. This means that a system can scale more or less linearly both to handle the volume of the requests coming in and, ideally, to handle the complexity.
  3. It is acceptable to be stale much of the time.
  4. The wisdom of crowds works amazingly well.
  5. People understand a graph composed of tree-like documents (HTML) related by links (URLs).

He also lists some rules that are more obvious, but bear mentioning:

  1. Pay attention to physics.
  2. Be as loosely coupled as possible.
  3. KISS. Keep it (the design) simple and stupid.

The rules are good, and interesting, but what really got my attention was Adam's application of the lessons to XML and Web services:

Lesson 3 tells us that elements in XML with values that are unlikely to change for some known period of time (or where it is acceptable that they are stale for that period of time, such as the title of a book) should be marked to say this. XML has no such model. You can use the HTTP protocol to say this in a very coarse way, but, for example, a book may mostly be invariant, though occasionally have a price change or a new comment, and certainly an item up for auction may see the price change.

Lesson 4 says that we shouldn't over-invest in making schemas universally understood. Certain ones will win when they hit critical mass (as RSS 2.0 has done), and others will not. Some will support several variant but successful strains (as RSS has also done where most readers can read RSS .92, RSS 1.0, RSS 2.0, and Atom). It also suggests that trying to mandate universal URLs for semantics is unlikely to work because, in general, they will not reach a critical mass, and thus people will not know about them--never mind take the trouble to use them. In short, if a large number of people aren't using something or there isn't a feedback mechanism to know what to use, then it will not be used.

Lessons 1 and 5 tell us that XML should be easy to understand without schemas and should let the clients intelligently decide how and when to fetch related information, especially large files such as photos, videos, and music, but even just related sets of XML such as comments on a post, reviews of a candidate, or ratings for a restaurant.
From ACM Queue - Learning from the Web
Referenced Mon Nov 07 2005 14:25:28 GMT-0700 (MST)

XML, of course, has limitations with respect to the lessons we might learn from the Web. Adam goes on to talk about RSS and Atom as "a hopeful turn."

Recently, an opportunity has arisen to transcend these limitations. RSS 2.0 has become an extremely popular format on the Web. RSS 2.0 and Atom (which is essentially isomorphic) both support a base schema that provides a model for sets. Atom's general model is a container (a <feed>) of <entry> elements in which each <entry> may contain any namespace scoped elements it chooses (thus any XML), must contain a small number of required elements (<id>, <updated>, and <title>), and may contain some other well-known ones in the Atom namespace such as <link>s. Even better, Atom clearly says that the order doesn't matter. This immediately gives a simple model for sets missing in XML. All one has to do is create a <feed> for a set and put each item in the set into an <entry>. Since all <entry> elements contain an <id> (which is a GUID) and an <updated> element (which is the date it was last altered), it is easy to define the semantics of replacing specific entries and even confirm that you are replacing the ones that you think (e.g., they have the same <id> and the same <updated>. Since they have a title, it is easy to build a user interface to display menus of them. Virtually all Atom entries have a <content> or <summary> element (or both) that summarizes the value of the entry in a user-friendly way and enhances the automatic user interface. If the entry has related binary information such as pictures or sound, it contains <link> elements that have attributes describing the MIME type and size of the target, thus letting the consumer of an Atom feed make intelligent choices about which media to fetch when and how, and resolving the outage XML has in this regard.

Adam's conclusion about today's databases is the most interesting (and the part that I hadn't quite grokked before). His summary is that today's databases violate almost every lesson we've learned from the web:

  • The don't easily support simple relaxed text formats and protocols.
  • They don't enable people to harness Moore's law in parallel.
  • They don't optimize caching when it is OK to be stale.
  • The don't let schemas evolve for a set of items using a bottom-up consensus/tipping point.
  • The don't handle flexible graphs (or trees) well.
  • They haven't made their queries simple and flexible.

I've been working with a company called Aradyme that has a database product with dynamic schema that does well on some of Adam's points. At present, "it's difficult for a tools-vendor to raise money" so their marketing and sales has been almost entirely about data cleansing (and they're doing pretty well at it). But at some point, I think the underlying tool deserves more exposure. Consequently, I'm intensely interested in the lessons Adam thinks databases ought to learn.