One of my projects performs a lot of web scraping. Once every n units of time (where n can be days or weeks), a batch process fetches several web pages and extracts information from them. It's a problem solved very well.
I designed this system around the idea of a pipeline of related processes, where each component is as independent and idempotent as possible. This has positives and negatives; it's an abstraction like any other.
I initially wrote the "fetch remote web page" and "analyze data from that page" as a single step, because I thought "analyze" was the main goal and "fetch" was a dependent task. I separated them a couple of weeks ago to simplify the system: analysis now expects data to be there, while fetching can be parallel on a single or across multiple machines. (Testing the analysis step is also much easier because feeding in dummy data is now trivial.)
I use the filesystem as a cache for these fetched files. That's easy to
manage. I modified the role I use to grab data for the analysis stage to look
in the cache first, then fall back to a network request. That was easy too. The
get_formatted_data_for_analysis()
method looked something like:
sub get_formatted_data_for_analysis
{
my ($self, $type, $key) = @_;
my $cached_path = $self->get_cached_path( $type, $key );
if (-e $cached_path)
{
my $text = read_file( $cached_path );
return $self->formatter->format_string( $text ) if $text;
}
return $self->formatter->format_string( $self->fetch_by_url( $type, $key ) );
}
I thought I was done. This trivial caching layer took five minutes to write and gave my project a lot of flexibility.
I thought this would speed up the processing stage, because I was able to make the fetching stage embarrassingly parallel so that more than one fetch could block on network IO simultaneously. My rough benchmark didn't show any speed improvement, but it was fast enough, so I moved on.
On Friday I decided to profile the slowest stage of the application with Devel::NYTProf. The slowest stage was the processing stage. I isolated it so that it performed no network fetching. It was still slow.
One of the formatter modules used to extract data from web pages is HTML::FormatText::Lynx.
It allows me to run lynx --dump
to strip out all of the HTML and
other formatting of a document. The formatter allows you to pass in the name of
a file or the contents of a file as a string.
For some reason, most of the time in the processing stage in the profile was spent in file IO. That wasn't too surprising; these aren't all small files and there may be thousands of them. I dug deeper.
Most of the time in the processing stage in the profile was spent in reading the files in my method and reading files in the formatter—reading files, even though I was passing the contents of those files to the formatter as strings.
I poked around at a few other things, but came back to the source code of the formatter. A comment in HTML::FormatExternal says:
format_string()
takes the easy approach of putting the string in a temp file and lettingformat_file()
do the real work. The formatter programs can generally read stdin and write stdout, so could do that withselect()
to simultaneously write and read back.
In other words, all of the work I was doing to read in files was busy work, duplicating what the formatter was about to do anyway. (Okay, I stared at the code for a couple of minutes, thinking about various approaches of rewriting it and submitting a patch or monkey patching it. Then I turned lazier and wiser.) I rewrote my code:
sub get_formatted_data_for_analysis
{
my ($self, $type, $key) = @_;
my $cached_path = $self->get_cached_path( $type, $key );
return $self->formatter->format_file( $cached_path ) if -e $cached_path;
return $self->formatter->format_text( $self->fetch_by_url( $type, $key ) );
}
The result was a 25% performance improvement.
Three things jumped out at me in this process. First, how nice is it to have a working tool like NYTProf and a community that distributes source code, so that I could examine the whole stack of my application to isolate performance problems? Second, how interesting that an assumption and an admitted shortcut in a dependency could have such an effect on my own code. Third, how much more I like my new code with all of the file handling gone; pushing that responsibility elsewhere is a nice simplification without the performance improvement.
Perhaps the two tools I miss most from my C programming days are Valgrind/Callgrind and KCachegrind, but NYTProf goes a long way toward filling that gap. Besides, I'm at least 20 times more productive with a language like Perl.