From ODT to POD (with practical dynamic language features)

Most of the interesting problems of programming are a combination of applied theory and the messy details of your specific problem.

In From ODT to POD with Perl I demonstrated the use of open classes to solve a transliteration problem of walking an existing tree of existing objects. That's the applied theory. When you already have parent and child relationships expressed in an existing object model, representation and traversal are simple. Similarly, when you know specific details of what you want to emit when you visit each object in such a tree, emitting the right things is simple.

The messy details in this problem come from figuring out what to emit. The code I posted has something called "style mappings" which translate the names of styles defined in the header of the ODT to POD styles.

(Remember that this code is a one-off conversion program, so it can be a little bit messy and it can live in a single file. It also doesn't have to be 100% perfect. We can edit out a few nits if it gets a couple of small things wrong, because the goal of the program is to save us many hours of work.)

Remember that the styles in the ODT file look like:

<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
    <style:paragraph-properties fo:background-color="#666699">
        <style:background-image />
    </style:paragraph-properties>
    <style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="14pt" fo:font-weight="bold" style:font-size-asian="14pt" style:font-weight-asian="bold" style:font-size-complex="14pt" style:font-weight-complex="bold" />
</style:style>

This XML is reasonably sane; it's reasonably regular. I went through the XML for a file formatted with our document template and made a list of the formatting characteristics of every unique markup element we want to express in our PseudoPod output. For example, every element we'll express as a header has a specific background color. Every element we'll express with the code tag (C<>) has a font name of "Courier New". Of two tags (style:paragraph-properties and style:text-properties, siblings under a style:style tag), I care about six attributes:

fo:background-color, the background color of the paragraph
fo:color, the color of the font itself
fo:font-weight, any boldness of the font
style:font-name, the name of the font
fo:font-style, any italicizing of the font
fo:font-size, any change in the font's size

The various allowed combinations of those six attributes for any of the defined styles produce the names of the methods to call to emit the appropriate POD for text nodes which have those styles applied.

This is easier to see with code:

sub get_xml_style_methods
{
    my $xpath   = shift;
    my $nodeset = $xpath->find( '//style:style' );
    my %styles  =
    (
        Standard => 'toPodForPlain',
        Empty    => 'toPodForEmpty'
    );

    for my $node ($nodeset->get_nodelist)
    {
        my $style_name = $node->getAttribute( 'style:name' );
        my $paraprops  = $node->find( './style:paragraph-properties' );
        my $bgcolor    = $paraprops
                       ? $paraprops->shift
                                   ->getAttribute('fo:background-color')
                       : '';
        my $textprops  = $node->find( './style:text-properties' );

        my ($color, $weight, $name, $style, $size) = ('') x 5;

        if ($textprops)
        {
            my $text_node = $textprops->shift;
            my $maybe     = sub
            {
                $text_node->getAttribute( shift ) || ''
            };

            $color        = $maybe->( 'fo:color'       );
            $weight       = $maybe->( 'fo:font-weight' );
            $name         = $maybe->( 'style:font-name');
            $style        = $maybe->( 'fo:font-style'  );
            $size         = $maybe->( 'fo:font-size'   );
        }

        my @properties;

        if ($bgcolor)
        {
            @properties     = 'Head0'  if $bgcolor eq '#666699';
            @properties     = 'Head2'  if $bgcolor eq '#9999cc';
        }
        else
        {
            push @properties, 'Code'   if $name    eq 'Courier New';
            push @properties, 'Bold'   if $weight  eq 'bold';
            push @properties, 'Italic' if $style   eq 'italic';
            @properties     = 'Head1'  if $size    eq '14pt';
        }

        @properties          = 'Plain' unless @properties;
        my $type             =  $style_name    =~ /^P/
                             && $properties[0] !~ /^Head/
                             ? 'Para'
                             : '';
        $styles{$style_name} = 'toPodFor'
                             . join( '', sort @properties ) . $type;
    }

    return \%styles;
}

Even though in theory any style could have any combination of those six attributes, in practice our documents are very similar and consistent. Any background color immediately makes a style only and ever a toPodForHead\d style.

(I like the use of the closure $maybe to abstract away the || '' repetition. Learn Scheme!)

I didn't mention that the names of styles have the form P\d+ or T\d+ in the ODT file. These names seem to signify whether the style applies to a block-level element (a paragraph as a whole) or a snippet of inline text (a word in monospace or italicized text, for example.) Rather than dealing with those individually, I glom them all together and append the word Para to those styles which apply to paragraphs as a whole. It's a shortcut. It's a little messy.

I did offer a small apologia for this code not being fully abstracted and factored as far as you might like, but note that this code is fully encapsulated into its own function. It takes an XML::XPath object and returns a hash of style names to transliteration method names. It doesn't mess with global state.

That's important because:

#!/usr/bin/env perl

use Modern::Perl;
use autodie;

use Archive::Zip;
use HTML::Entities;
use Regexp::Assemble;

use XML::XPath;
use XML::XPath::XMLParser;

exit main( @ARGV );

sub main
{
    my @files = @_;
    @files    = get_all_files() unless @files;

    for my $file (@files)
    {
        my $xml = get_xml_contents( $file );
        my $pod = rewrite_xml( $xml );
        write_as_pod( $file, $pod );
    }

    return 0;
}

sub get_all_files { <*.odt> }

sub get_xml_contents
{
    my $file    = shift;
    my $zip     = Archive::Zip->new( $file );
    my $content = $zip->memberNamed( 'content.xml' );
    return $content->contents;
}

...

I could let this program work on one file at a time, but I'd rather loop within main() over any and all files given as arguments then write a loop in shell. I do want to make that hash of style names to transliteration methods available somehow globally, but only while processing each file individually, so closures again come to the rescue:

{
    my $style_methods;

    sub set_methods_for_styles   { $style_methods = shift    }
    sub clear_methods_for_styles { undef $style_methods      }
    sub get_method_for_style     { $style_methods->{ $_[0] } }
}

sub rewrite_xml
{
    my $contents = shift;
    my $xpath    = XML::XPath->new( xml => $contents );

    set_methods_for_styles( get_xml_style_methods( $xpath ) );
    my $pod           = xml_to_pod( $xpath );
    clear_methods_for_styles();

    return $pod;
}

Hiding that global state behind an accessor with a writer and clearer means that only one place has to reset that global state between working on files. If I were to write this program with more robustness, I'd add an exception handler around the call to xml_to_pod() so that the clearer always gets called (unless something catastrophic happens, like a segfault or my server falling into a micro black hole), but the two types of problems I've had in this code are compilation-killing typos and missing transliteration methods. Both kill the program effectively enough that they need immediate fixing.

All in all, this program was a lot easier to write than I thought it would be. I'd written a regex-based converter from Google Docs HTML output to PseudoPod before and it was awful. This approach was a lot easier. Much credit goes to ODT for being relatively sane in its XML, but much credit goes to the ClubCompy documentation for having a regular (if informal) template. Perl deserves a lot of credit for great XML processing modules, open classes, dynamic dispatch, and other higher level programming techniques that let clever people make little messes to make difficult problems tractable.

If you want to solve this problem with less monkeypatching, Robin Smidsrød's XML::Rabbit looks very effective. The essential messiness of the problem doesn't go away, but creating an effective object model from an XML document with XPath will get you halfway to the right solution.

From ODT to POD (with practical dynamic language features)

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry