Detecting Bots and Spiders with Plack Middleware

If you analyze the user requests of your web site, you'll have to deal with enormous numbers of bots and spiders and other automated requests for your resources which don't represent measurable users. As promised in Annotating User Events for Cohort Analysis, here's how I handle them.

I wrote a tiny piece of Plack middleware which I enabled in the .psgi file which bundles my application:

package MyApp::Plack::Middleware::BotDetector;
# ABSTRACT: Plack middleware to identify bots and spiders

use Modern::Perl;
use Plack::Request;
use Regexp::Assemble;

use parent 'Plack::Middleware';

my $bot_regex = make_bot_regex();

sub call
{
    my ($self, $env) = @_;
    my $req          = Plack::Request->new( $env );
    my $user_agent   = $req->user_agent;

    if ($user_agent)
    {
        $env->{'BotDetector.looks-like-bot'} = 1 if $user_agent =~ qr/$bot_regex/;
    }

    return $self->app->( $env );
}

sub make_bot_regex
{
    my $ra = Regexp::Assemble->new;
    while (<DATA>)
    {
        chomp;
        $ra->add( '\b' . quotemeta( $_ ) . '\b' );
    }

    return $ra->re;
}

1;
__DATA__
Baiduspider
Googlebot
YandexBot
AdsBot-Google
AdsBot-Google-Mobile
bingbot
facebookexternalhit
libwww-perl
aiHitBot
Baiduspider+
aiHitBot
aiHitBot-BP
NetcraftSurveyAgent
Google-Site-Verification
W3C_Validator
ia_archiver
Nessus
UnwindFetchor
Butterfly
Netcraft Web Server Survey
Twitterbot
PaperLiBot
Add Catalog
1PasswordThumbs
MJ12bot
SmartLinksAddon
YahooCacheSystem
TweetmemeBot
CJNetworkQuality
YandexImages
StatusNet
Untiny
Feedfetcher-Google
DCPbot
AppEngine-Google

Plack middleware wraps around the application to examine and possibly modify the incoming request, to call the application (or the next piece of middleware), and to examine and possibly modify the outgoing response. Plack conforms to the PSGI specification to make this possible.

Update: This middleware is now available as Plack::Middleware::BotDetector from the CPAN. Thanks to Big Blue Marble and Trendshare for sponsoring its development and release.

All of that means that any piece of middleware gets activated by something which calls its call() method, passing in the incoming request as the first parameter. This request is a hash with specified keys. The application, or at least the next piece of middleware to call, is available from object's accessor method app().

(I'm lazy. I use Plack::Request to turn $env into an object. This is not necessary.)

The rest of the code is really simple. I have a list of unique segments of the user agent strings I've seen in this application. I use Regexp::Assemble to turn these words into a single (efficient) regex. If the incoming request's user agent string matches anything in the regex, I add a new entry to the environment hash.

With that in place, any other piece of middleware executed after this point in the request—or the application itself—can examine the environment and choose different behavior based on the bot-looking-ness if any request. My cohort event logger method looks like:

=head2 log_cohort_event

Logs a cohort event. At the end of the request, these get cleared.

=cut

sub log_cohort_event
{
    my ($self, %event)  = @_;
    return if $self->request->env->{'BotDetector.looks-like-bot'};
    $event{usertoken} ||= $self->sessionid || 'unknownuser';

    push @{ $self->cohort_events }, \%event;
}

The embolded line is all it took in my application to stop logging cohort events for spiders. If and when I see a new spider in the logs, I can exclude it by adding a line to the middleware's DATA section and restarting the server.

(You might rather store this information in a database, but I'd rather build the regex once than loop through a database with a LIKE query. I haven't found an ideal alternate solution, which is why I haven't put this on the CPAN. Perhaps this is two modules, one for the middleware and one which exports a regex to identify spider user agents.)

There's one more trick to this cohort event logging: traceability. That's the topic for next time.

7 Comments

djzort.myopenid.com | August 21, 2012 5:08 PM

My take away is that there is probably a need for a "Acme::RobotsRegex" module on cpan? Perhaps better named.

I am also sitting here thinking about how interesting it would be to do monitoring functions via Plack plugins. Like latency monitoring, or perhaps cross site scripting detection. Functions like those in "introscope" but without involving CA

tstanton.myopenid.com | August 23, 2012 8:18 AM

There should be a space between "package" and "MyApp" on line 1 of the package source. :)

Nice use of Regexp::Assemble and a __DATA__ section! Although I have yet to play with Plack, I like this idea of using middleware for some intermediary logic.

anielsen.myopenid.com | August 27, 2012 2:04 AM

I like the idea of a middleware module for this. But wouldn't it be better to use (and perhaps update) one of the existing modules for identifying bots?

Neil Bowers wrote this nice article
http://neilb.org/reviews/user-agent.html

chromatic replied to comment from anielsen.myopenid.com | August 27, 2012 1:35 PM

I didn't do this for two reasons.

First, I didn't find these modules (or the review) on my initial search of the CPAN. Second, Neil's conclusions show that the tradeoff between accuracy and speed is awful for the cases he saw. What I have now is definitely not optimal from a code and reuse standpoint, but it doesn't slow down every request dramatically.

wanradt [launchpad.net] | September 8, 2012 8:32 AM

I tested your code against more than 10000 user agent strings (from http://useragentstring.com/pages/All/). On my laptop it took about quarter of second to filter out 80 bot strings:

$ time perl test__bot_detector.pl
     80
real    0m0.255s
user    0m0.236s
sys     0m0.020s

Not bad at all, thank you!

Michael MD | November 7, 2012 8:54 PM

this sounds way too complex and heavy

need a lightweight way to detect NEW bots ... from their behavior

it it has to run on every page load it would nead to be EXTREMELY lightweight .. ie no heavier that hitting one or two small files ... no mysql even .. so as not to add significant latency to page loads and allow it to quickly detect and block some of those really nasty ones with insane request rates!

so the detection methods will need to cope with pretty hardcore request rates long enough to detect and block!

Toby Inkster replied to comment from Michael MD | November 8, 2012 12:50 AM

Michael, you could try including three hidden links somewhere on your homepage a la:

<a href="/bot-detect/1"></a>
<a href="/bot-detect/2"></a>
<a href="/bot-detect/3"></a>

An IP address that hits more than one of those links is flagged as a suspected bot.

Detecting Bots and Spiders with Plack Middleware

Tags:

7 Comments

Modern Perl: The Book

Categories

Monthly Archives

Pages

About this Entry