OSCON 2004: Ben Galbraith on Publishing a Medical Textbook with Apache FOP and XSL-FO


Here's the challenge:

  • Entirely new kind of textbook.
    • Structured content, not prose
    • Extensive pictures
    • Books generated on demand
  • Reuse content in other forms

The first attempt was the Microsoft tool approach using Word with special templates as an authoring tool, Word VBA to convert Word to HTML-ish format, access used to store content, and then VB and Framemaker macros would generate content. A whole generation of books was developed with this technology, but it was a mess and the content was not reusable.

The second attempt used a Java Swing-based editing tool (modified JTree with Word-like editing features and an XML binding layer). XSLT converted the XML into LaTeX and then MiKTeX rendered the LaTeX into PDF. The authors liked the authoring tool. The rendering system was too inflexible. LaTex is showing its age and the layout was unstable. Also the data binding was too inflexible requiring that it be recompiled with every change.

The third attempts was the replace the LaTeX stylesheet with XML-FO. This was more stable and flexible than the LaTeX solution.

The editing tool presents structured data as forms to be filled out rather than relying on authors (doctors) to write prose. Creating the editor was a big job, but worth the effort.

XML-FO v1.0 has some limitation (no support for multi-column region per page, etc.) but XML-FO 1.1 will eclipse TeX/LaTeX functionality. Apache FOP is an open source XML-FO processor. Its not fully compliant and may never be. The project was recently rebooted to start recoding from scratch. RenderX XEP is a commercial XML-FO processor.

One additional problem they ran into was the amount of computation that rendering a full medical text with all its images requires. Ben used JNGI to create a grid of computers to do the rendering a few pages at a time.