When I have to convert data between formats, I reach for Perl. While many
people think Perl's built in regular expressions make data munging easy, my
experience is that Perl's multi-paradigm nature and dynamic programming
flexibility are more important.
The Problem
I help run ClubCompy, a retro-inspired,
zero-installation, browser-based programming environment designed to help
children learn about computing. One of the reasons they recruited me is to
design the educational components, including documentation. (I also know a few
things about compilers and business.)
While ClubCompy has a surprising amount of power in its underlying virtual
machine, that power is currently exposed in a programming language called
Tasty—a
mixture of 8-bit BASIC and Logo.
As with most systems which evolve from a simple idea into something else,
following the law of opportunism, the project's structure and organization and
tooling has accreted organically instead of following a rigid design. (Startup
hackers: your job is to prune things when necessary until you discover the core
of your business.) In particular, the documentation for the Tasty language
exists in a series of OpenOffice files,
one per language keyword.
The good news is that documentation exists. It's mostly complete, too: every
keyword has documentation, and most of it is comprehensive. (Maybe 15 or 20%
needs expansion, but we'll get there.)
The bad news is that the documentation exists in .odt files.
They're not binary blobs, but they don't fit with our publishing system:
they're too difficult to convert to clean PDF or very clean HTML for use
throughout the system. They're also a mess when checked into source
control.
Monday I decided to convert them to POD. (ClubCompy uses the Onyx Neon publishing toolchain designed for
things like Modern
Perl: the book. Everything not yet available on the CPAN is available from
my Github account.)
Inside ODT Files
An OpenOffice .odt file is a zipped archive of several other files.
Fortunately, there's only one file I care about and very fortunately, it's a
reasonably self-contained XML file. Getting the contents of
content.xml is easy with a little bit of Archive::Zip code:
use Archive::Zip;
sub get_xml_contents
{
my $file = shift;
my $zip = Archive::Zip->new( $file );
my $content = $zip->memberNamed( 'content.xml' );
return $content->contents;
}
All of the Tasty keywords follow a standard template for documentation. This
is both good and bad. It's good that discovering out how OpenOffice represents
each unique element in XML is relatively easy: figure it out once and that
representation should apply to all files. It's bad that the documentation
template didn't use custom semantic styles, like "Top-level Header" and
"Program Code".
That means all of the styles are ad hoc:
<office:automatic-styles>
<style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
<style:paragraph-properties fo:background-color="#666699">
<style:background-image />
</style:paragraph-properties>
<style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="14pt" fo:font-weight="bold" style:font-size-asian="14pt" style:font-weight-asian="bold" style:font-size-complex="14pt" style:font-weight-complex="bold" />
</style:style>
<style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard" style:master-page-name="">
<style:paragraph-properties fo:margin-left="0.2602in" fo:margin-right="0in" fo:text-indent="0in" style:auto-text-indent="false" style:page-number="auto" fo:background-color="#9999cc">
<style:background-image />
</style:paragraph-properties>
<style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="12pt" fo:font-weight="bold" style:font-size-asian="12pt" style:font-weight-asian="bold" style:font-size-complex="12pt" style:font-weight-complex="bold" />
</style:style>
...
</office:automatic-styles>
I'll explain that more later.
The actual text of each file resembles:
<office:body>
<office:text>
<text:sequence-decls>
<text:sequence-decl text:display-outline-level="0" text:name="Illustration" />
<text:sequence-decl text:display-outline-level="0" text:name="Table" />
<text:sequence-decl text:display-outline-level="0" text:name="Text" />
<text:sequence-decl text:display-outline-level="0" text:name="Drawing" />
</text:sequence-decls>
<text:p text:style-name="P1">Keyword</text:p>
<text:p text:style-name="P9">WHILE<text:span text:style-name="T1">-DO</text:span>
</text:p>
<text:p text:style-name="P8">
<text:span text:style-name="T2">END</text:span>
</text:p>
...
</office:text>
</office:body>
All of the text of the documentation is available under
<text:p>
tags.
Extracting Text
Extracting this text is a job for XPath. While I could get more specific
with the XPath expression (find all direct children of
<office:text>
), I went for the simple solution at first:
use XML::XPath;
use XML::XPath::XMLParser;
sub rewrite_xml
{
my $contents = shift;
my $xpath = XML::XPath->new( xml => $contents );
set_methods_for_styles( get_xml_style_methods( $xpath ) );
my $pod = xml_to_pod( $xpath );
clear_methods_for_styles();
return $pod;
}
sub xml_to_pod
{
my $xpath = shift;
my $nodeset = $xpath->find( '//text:p' );
my $pod;
for my $node ($nodeset->get_nodelist)
{
my $style = $node->getAttribute( 'text:style-name' );
$style = 'Empty' if @{ $node->getChildNodes } == 0;
my $method = get_method_for_style( $style );
$pod .= $node->$method;
}
return $pod;
}
Ignore the get_method_for_style()
calls for now. The important
part of xml_to_pod
is that it finds these tags in the XML and
performs an action on each of them.
What's that action? Transforming it to POD, of course.
Look in the sample XML again. Each of the paragraphs has an associated style
tag. That style refers to one of the styles declared earlier in that file.
Given the name of a style, the body of the loop finds the name of a
method and calls that method to transliterate the contents of that tag to
POD.
Transliterating to POD
Here's where the power of Perl really shines. Every node in that nodeset is
an instance of XML::XPath::Node::Element.
That class knows nothing about POD. At least, it knows nothing about POD until
I declared some methods in it:
package XML::XPath::Node::Element;
sub kidsToPod { join '', map { $_->toPod } shift->getChildNodes }
sub toPod
{
my $self = shift;
my ($name) = $self->getName =~ /text:(\w+)/;
my $method = 'toPodFor' . ucfirst $name;
return $self->$method;
}
sub toPodForEmpty { '' }
sub toPodForS { ' ' }
sub toPodForTab { ' ' }
sub toPodForSpan
{
my $self = shift;
my $style = $self->getAttribute( 'text:style-name' ) // '';
$style = 'Empty' if @{ $self->getChildNodes } == 0;
my $method = main::get_method_for_style( $style );
return $self->$method;
}
sub toPodForBold { 'B<' . shift->kidsToPod . '>' }
sub toPodForCode { 'C<' . shift->kidsToPod . '>' }
sub toPodForCodePara { ' ' . shift->kidsToPod . "\n" }
sub toPodForItalic { 'I<' . shift->kidsToPod . '>' }
sub toPodForPlain { shift->wrapKids( '', '' ) }
sub toPodForPlainPara { shift->wrapKids( '', '' ) . "\n\n" }
sub toPodForBoldCode { 'C<B<' . shift->kidsToPod . '>>' }
sub toPodForBoldCodePara { 'C<B<' . shift->kidsToPod . ">>\n" }
sub toPodForHead0 { shift->wrapKids( '=head0 ', "\n\n" ) }
sub toPodForHead1 { shift->wrapKids( '=head1 ', "\n\n" ) }
sub toPodForHead2 { shift->wrapKids( '=head2 ', "\n\n" ) }
sub wrapKids
{
my ($self, $pre, $post) = @_;
my $kid_text = $self->kidsToPod;
return '' unless $kid_text;
return $pre . $kid_text . $post;
}
Because Perl has open classes, you can add methods to classes (or redefine
methods) any time you want. Because Perl has dynamic method dispatch, you can
use a string as the name of a method to call.
You can see that this code gets a little bit messy here. That's part and
parcel of the
tree transformation technique central to compilers; the real world is
messy, and that mess has to go somewhere.
The wrapKids()
method handles the case where one of these nodes has no textual content but does have a specific style. Given a snippet of documentation like:
Example 1:
10 x = 0
20 WHILE x LT 26 DO
30 PRINT TOCHAR x + 65
40 x = x + 1
50 END
RUN
(prints ABCDEFGHIJKLMNOPQRSTUVWXYZ)
... the blank line between RUN
and the output is a unique
paragraph with the monospace font applied. A naïve output from one of
these methods might produce the POD C<>
for that line.
wrapKids()
prevents that.
This open class approach works very well. It scales well too in
terms of complexity. Even if this code eventually migrates to build a POD
document model (see Pod::PseudoPod::DOM),
giving individual nodes the responsibility of emitting a tree or text moves the
custom behavior to where it most belongs.
(The benefit of a DOM is that basic tree transformation rules can take care
of pruning out unnecessary elements, such as the blank code line.)
The Little Details
The XML::XPath::Node::Element
s may nest, but you can see how
that nesting works just fine through the toPod()
method. Those
::Element
classes may themselves also contain XML::XPath::Node::Text
instances as children. These objects represent plain text.
So far, I've only found one situation where this plain text needs any
manipulation. Adding one method fixes this:
package XML::XPath::Node::Text;
sub toPod
{
my $raw_text = HTML::Entities::decode_entities( shift->toString );
return main::encode_pod( $raw_text );
}
The encode_pod()
function (it's in main
so as not
to make it available as a method inadvertently) is:
use Regexp::Assemble;
my %escapes =
(
'<' => 'E<lt>',
'>' => 'E<gt>',
);
sub encode_pod
{
state $replace = make_regexp( \%escapes );
my $text = shift;
$text =~ s/($replace)/$escapes{$1}/g;
return $text;
}
sub make_regexp
{
my $escapes = shift;
my $ra = Regexp::Assemble->new;
$ra->add( $_ ) for keys %$escapes;
return $ra->re;
}
More robust solutions exist, but so far this is all I've needed.
I do admit that the implementation is a little messy in places. That's one
of the problems with this compiler technique: sometimes you have data that
needs to be available everywhere but you don't want to pass it as arguments
everywhere and you don't want to wrap up everything in intermediary objects
because you're already using perfectly good objects from elsewhere.
I haven't shown the code which identifies styles and makes the hash of style
name to output method yet; that's for the next post. I'm sure you can start to
figure out how it works already.