Most of the interesting problems of programming are a combination of applied theory and the messy details of your specific problem.
In From ODT to POD with Perl I demonstrated the use of open classes to solve a transliteration problem of walking an existing tree of existing objects. That's the applied theory. When you already have parent and child relationships expressed in an existing object model, representation and traversal are simple. Similarly, when you know specific details of what you want to emit when you visit each object in such a tree, emitting the right things is simple.
The messy details in this problem come from figuring out what to emit. The code I posted has something called "style mappings" which translate the names of styles defined in the header of the ODT to POD styles.
(Remember that this code is a one-off conversion program, so it can be a little bit messy and it can live in a single file. It also doesn't have to be 100% perfect. We can edit out a few nits if it gets a couple of small things wrong, because the goal of the program is to save us many hours of work.)
Remember that the styles in the ODT file look like:
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:paragraph-properties fo:background-color="#666699">
<style:background-image />
</style:paragraph-properties>
<style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="14pt" fo:font-weight="bold" style:font-size-asian="14pt" style:font-weight-asian="bold" style:font-size-complex="14pt" style:font-weight-complex="bold" />
</style:style>
This XML is reasonably sane; it's reasonably regular. I went through the XML
for a file formatted with our document template and made a list of the
formatting characteristics of every unique markup element we want to express in
our PseudoPod output. For example, every element we'll express as a header has a specific background color. Every element we'll express with the code tag (C<>
) has a font name of "Courier New". Of two tags (style:paragraph-properties
and style:text-properties
, siblings under a style:style
tag), I care about six attributes:
fo:background-color
, the background color of the paragraphfo:color
, the color of the font itselffo:font-weight
, any boldness of the fontstyle:font-name
, the name of the fontfo:font-style
, any italicizing of the fontfo:font-size
, any change in the font's size
The various allowed combinations of those six attributes for any of the defined styles produce the names of the methods to call to emit the appropriate POD for text nodes which have those styles applied.
This is easier to see with code:
sub get_xml_style_methods
{
my $xpath = shift;
my $nodeset = $xpath->find( '//style:style' );
my %styles =
(
Standard => 'toPodForPlain',
Empty => 'toPodForEmpty'
);
for my $node ($nodeset->get_nodelist)
{
my $style_name = $node->getAttribute( 'style:name' );
my $paraprops = $node->find( './style:paragraph-properties' );
my $bgcolor = $paraprops
? $paraprops->shift
->getAttribute('fo:background-color')
: '';
my $textprops = $node->find( './style:text-properties' );
my ($color, $weight, $name, $style, $size) = ('') x 5;
if ($textprops)
{
my $text_node = $textprops->shift;
my $maybe = sub
{
$text_node->getAttribute( shift ) || ''
};
$color = $maybe->( 'fo:color' );
$weight = $maybe->( 'fo:font-weight' );
$name = $maybe->( 'style:font-name');
$style = $maybe->( 'fo:font-style' );
$size = $maybe->( 'fo:font-size' );
}
my @properties;
if ($bgcolor)
{
@properties = 'Head0' if $bgcolor eq '#666699';
@properties = 'Head2' if $bgcolor eq '#9999cc';
}
else
{
push @properties, 'Code' if $name eq 'Courier New';
push @properties, 'Bold' if $weight eq 'bold';
push @properties, 'Italic' if $style eq 'italic';
@properties = 'Head1' if $size eq '14pt';
}
@properties = 'Plain' unless @properties;
my $type = $style_name =~ /^P/
&& $properties[0] !~ /^Head/
? 'Para'
: '';
$styles{$style_name} = 'toPodFor'
. join( '', sort @properties ) . $type;
}
return \%styles;
}
Even though in theory any style could have any combination of those six
attributes, in practice our documents are very similar and consistent. Any
background color immediately makes a style only and ever a
toPodForHead\d
style.
(I like the use of the closure $maybe
to abstract away the
|| ''
repetition. Learn
Scheme!)
I didn't mention that the names of styles have the form P\d+
or
T\d+
in the ODT file. These names seem to signify whether the
style applies to a block-level element (a paragraph as a whole) or a snippet of
inline text (a word in monospace
or italicized text, for
example.) Rather than dealing with those individually, I glom them all together
and append the word Para
to those styles which apply to paragraphs
as a whole. It's a shortcut. It's a little messy.
I did offer a small apologia for this code not being fully abstracted and factored as far as you might like, but note that this code is fully encapsulated into its own function. It takes an XML::XPath object and returns a hash of style names to transliteration method names. It doesn't mess with global state.
That's important because:
#!/usr/bin/env perl
use Modern::Perl;
use autodie;
use Archive::Zip;
use HTML::Entities;
use Regexp::Assemble;
use XML::XPath;
use XML::XPath::XMLParser;
exit main( @ARGV );
sub main
{
my @files = @_;
@files = get_all_files() unless @files;
for my $file (@files)
{
my $xml = get_xml_contents( $file );
my $pod = rewrite_xml( $xml );
write_as_pod( $file, $pod );
}
return 0;
}
sub get_all_files { <*.odt> }
sub get_xml_contents
{
my $file = shift;
my $zip = Archive::Zip->new( $file );
my $content = $zip->memberNamed( 'content.xml' );
return $content->contents;
}
...
I could let this program work on one file at a time, but I'd rather loop
within main()
over any and all files given as arguments then write
a loop in shell. I do want to make that hash of style names to
transliteration methods available somehow globally, but only while
processing each file individually, so closures again come to the
rescue:
{
my $style_methods;
sub set_methods_for_styles { $style_methods = shift }
sub clear_methods_for_styles { undef $style_methods }
sub get_method_for_style { $style_methods->{ $_[0] } }
}
sub rewrite_xml
{
my $contents = shift;
my $xpath = XML::XPath->new( xml => $contents );
set_methods_for_styles( get_xml_style_methods( $xpath ) );
my $pod = xml_to_pod( $xpath );
clear_methods_for_styles();
return $pod;
}
Hiding that global state behind an accessor with a writer and clearer means
that only one place has to reset that global state between working on files. If
I were to write this program with more robustness, I'd add an exception handler
around the call to xml_to_pod()
so that the clearer always gets
called (unless something catastrophic happens, like a segfault or my server
falling into a micro black hole), but the two types of problems I've had in
this code are compilation-killing typos and missing transliteration methods.
Both kill the program effectively enough that they need immediate fixing.
All in all, this program was a lot easier to write than I thought it would be. I'd written a regex-based converter from Google Docs HTML output to PseudoPod before and it was awful. This approach was a lot easier. Much credit goes to ODT for being relatively sane in its XML, but much credit goes to the ClubCompy documentation for having a regular (if informal) template. Perl deserves a lot of credit for great XML processing modules, open classes, dynamic dispatch, and other higher level programming techniques that let clever people make little messes to make difficult problems tractable.
If you want to solve this problem with less monkeypatching, Robin Smidsrød's XML::Rabbit looks very effective. The essential messiness of the problem doesn't go away, but creating an effective object model from an XML document with XPath will get you halfway to the right solution.