Mike Kruckenberg is from Tufts University. He's talking about how they built a system for managing documents and and displaying them for various media (i.e. content management). Mike, in case you're curious, the brother of Pete, a good friend.
Mike specifically concerned with translating documents for web and print (namely PDF). They created a document standard with a Schema and developed templates for XML authoring application to make creating the documents easy. They created an customized XML authoring environment from an off the shelf tool that was essentially the destination for any conversion process. They also provided an online tool for people who didn't have access to the authoring tool.
Existing HTML documents were cleaned up with Tidy and then a homegrown tool translated the cleaned-up HTML to XML. Once the XML was valid, the XML document was put into the database. For MS WOrd document, they tried a bunch of things, wvWare, saving as HTML, saving as RTF, and third party stand alone tools. They're looking forward to WordML. PowerPoint is a big tool for faculty, so it had to be easy to convert to an XML document. For PowerPoint, they have a service which create an XML document from the text and save JPEGs out and wraps everything up in XML.
Here's some questions about conversion I'd ask if there was time:
- Did you try reading Office documents into OpenOffice and then transforming the resulting XML?
- Did you try saving as PDF and then converting that to HTML?
- Are you supporting emerging standards like SlideML?
The transform is done using the libxml2 and libxslt libraries fro Gnome because the have good performance and command line and Perl interfaces. xmllint validates XML against a DTD. xsltproc renders XML as HTML.
Just rendering HTML isn't the goal however. The goal is to render HTML and PDF for print. Mike and his team used FOP and XSL:FO to create PDF.
Mike gives some lessons that they learned:
- Ensure XML is well formed and valid
- Lack of structure in the source document results in meaningless XML
- Special characters require the use of entity mappings
- Using the tool must be convenient
- FO transformations have limitations--read the documentation
- Fonts in PDF can be problematic and require embedding fonts
- Image and spacing issues cause problems and users don't understand the limitations
- The processes can be slow and CPU intensive so PDF documents need to be pre generated, not done in real time.
CIO Magazine published an article about this project.