Excluding Bot Traffic from Access Logs with Plack Middleware

When I wrote Plack::Middleware::BotDetector, I planned to use it only for filtering out non-humans from our cohort analysis system. (You can read the entire rationale and explanation in Detecting Bots and Spiders with Plack Middleware.)

Since I wrote that article, I extracted that middleware from our project and released it on its own as Plack::Middleware::BotDetector. As is often the case, solving one problem suggests the possibility of solving multiple problems.

When I build systems that analyze data, I try to make it possible that the analysis can improve over time. Anomalous cases should be obvious and easy to correct and, when corrected, should no longer be obvious (because they're no longer anomalous). When detecting non-human user agents, we analyze our access logs for likely candidates to add to the list used to construct the regex passed to Plack::Middleware::BotDetector.

I had written a small Perl program to analyze our logs and give a histogram of user agents, but I still ended up eyeballing that list to see if any new bot user agents had appeared. (You can tell your SEO strategy is working when you get more bot traffic.)

Anytime you find yourself reviewing data by hand, ask yourself if a computer can do it.

We use Plack, obviously. Plack::Runner enables default middleware, including Plack::Middleware::AccessLog. That's responsible for writing an access log (you can configure the location), and we used that because it was easy and available.

"Wait," I asked myself. "Why am I reviewing this log information when I have to remember to exclude most of it because I already know it's bot traffic?" More important, our system already knows it's bot traffic, because the BotDetector middleware is already excluding those requests our from cohort analysis event logging.

What if I used the BotDetector to decide whether to log a request's information? (We don't do anything with these access logs which requires us to keep data about bot traffic.) That way, every update to the BotDetector regex would exclude more and more bot traffic, and the only things we'd see in our daily reports would be real users and bots we needed to exclude.

I wrote a custom piece of middleware in about two minutes:

package MyApp::Plack::Middleware::AccessLogNoBots;
# ABSTRACT: Plack middleware which only logs non-bot requests

use Modern::Perl;
use parent 'Plack::Middleware::AccessLog';

sub call
{
    my $self = shift;
    my $env  = $_[0];

    return $env->{'BotDetector.looks-like-bot'}
         ? $self->app->( $env )
         : $self->SUPER::call( @_ );
}

1;

This class extends the AccessLog middleware class to override the call() method. If the request looks like it came from a spider, it passes through the request to the next middleware. Otherwise, it lets the parent class log the request.

Installing this in our .psgi file was more difficult than writing the class, which says more about how easy it was to write this class than anything else. The only complicating factor is that Plack::Runner takes the responsibility for setting up its AccessLog component. I ended up with something like:

use MyApp;
use MyApp::BotDetector;

use Plack::Builder;
use Plack::App::File;

my $app = builder
{
    enable 'Plack::Middleware::BotDetector',
        bot_regex => MyApp::BotDetector::bot_regex();
    enable 'Plack::Middleware::ConditionalGET';
    enable 'Plack::Middleware::ETag', file_etag => [qw/inode mtime size/];
    enable 'Plack::Middleware::ContentLength';

    if ($ENV{MA_ACCESS_LOG})
    {
        open my $logfh, '>>', $ENV{MA_ACCESS_LOG};
        $logfh->autoflush( 1 );

        enable '+MyApp::Plack::Middleware::AccessLogNoBots',
            logger => sub { $logfh->print( @_ ) };
    }

    MyApp->apply_default_middlewares(MyApp->psgi_app);
};

... where the presence of the environment variable governs the location of the access log file. I also changed the scripts we use to launch this .psgi file to pass the --no-default-middleware flag to Plack::Runner.

The results have been wonderful (except that our site looked a lot busier before, when the logs showed Baidu spidering the whole thing at least twice a day). The decorator pattern of Plack continues to demonstrate its value, and the cleanliness of extension and ease of writing this code argues yet again for putting conditionals (log or don't log) where they belong.

All I could ask for is a little more customizability for Plack::Runner to make some of the code in my .psgi file go away, but I'm probably at the point where it makes sense to avoid plackup and write my own program which calls Plack::Runner directly.

Update: Miyagawa pointed out that Plack::Middleware::Conditional offers an alternate way to accomplish the same thing without writing custom middleware:

my $app = builder
{
    enable 'Plack::Middleware::BotDetector',
        bot_regex => MyApp::BotDetector::bot_regex();
    enable 'Plack::Middleware::ConditionalGET';
    enable 'Plack::Middleware::ETag', file_etag => [qw/inode mtime size/];
    enable 'Plack::Middleware::ContentLength';
    enable_if { ! $_[0]->{ 'BotDetector.looks-like-bot'}  } 'AccessLog';

    MyApp->apply_default_middlewares(MyApp->psgi_app);
};

We didn't use this technique because of the way we wanted to handle the log file, but that's what the Conditional middleware is for.

Excluding Bot Traffic from Access Logs with Plack Middleware

Tags:

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry