Regular Expressions and Matching

Much of Perl's text processing power comes from its use of regular expressions. A regular expression (also regex or regexp) is a pattern which describes characteristics of a piece of text. A regular expression engine applies these patterns to match or to replace portions of text.

While mastering regular expressions is a daunting pursuit, a little knowledge will give you great power. Perl's core regex documentation includes a tutorial (perldoc perlretut), a reference guide (perldoc perlreref), and full documentation (perldoc perlre). Jeffrey Friedl's book Mastering Regular Expressions explains the theory and the mechanics of how regular expressions work.

Literals

A regex can be as simple as a substring pattern:

    my $name = 'Chatfield';
    say 'Found a hat!' if $name =~ /hat/;

The match operator (m//, abbreviated //) identifies a regular expression—in this example, hat. This pattern is not a word. Instead it means "the h character, followed by the a character, followed by the t character." Each character in the pattern is an indivisible element (an atom). An atom matches or it doesn't.

The regex binding operator (=~) is an infix operator (Fixity) which applies the regex of its second operand to a string as its first operand. When evaluated in scalar context, a match evaluates to a true value if it succeeds. The negated form of the binding operator (!~) evaluates to a true value unless the match succeeds.

Remember index!

The index builtin can also search for a literal substring within a string. Using a regex engine for that is like flying an autonomous combat drone to the corner store to buy cheese—but Perl lets you write code as it seems most clear to you.

The substitution operator, s///, is in one sense a circumfix operator (Fixity) with two operands. Its first operand (the part between the first and second delimiters) is a regular expression to match when used with the regex binding operator. The second operand (the part between the second and third delimiters) is a substring used to replace the matched portion of the string operand used with the regex binding operator. For example, to cure pesky summer allergies:

    my $status = 'I feel ill.';
    $status    =~ s/ill/well/;
    say $status;

The qr// Operator and Regex Combinations

The qr// operator creates first-class regexes. Use them as the operand of the match operator or the first operand of the substitution operator:

    my $hat = qr/hat/;
    say 'Found a hat!' if $name =~ /$hat/;

... or combine multiple regex objects into complex patterns:

    my $hat   = qr/hat/;
    my $field = qr/field/;

    say 'Found a hat in a field!'
        if $name =~ /$hat$field/;

    like( $name, qr/$hat$field/,
                   'Found a hat in a field!' );

Like is, with More like

Test::More's like function tests that the first argument matches the regex provided as the second argument.

Quantifiers

Regular expressions get more powerful through the use of regex quantifiers. These metacharacters govern how often a regex component may appear in a matching string. The simplest quantifier is the zero or one quantifier, or ?:

    my $cat_or_ct = qr/ca?t/;

    like( 'cat', $cat_or_ct, "'cat' matches /ca?t/" );
    like( 'ct',  $cat_or_ct, "'ct' matches /ca?t/"  );

Any atom in a regular expression followed by the ? character means "match zero or one of this atom." This regular expression matches if zero or one a characters immediately follow a c character and immediately precede a t character. This regex matches both the literal substrings cat and ct.

The one or more quantifier, or +, matches at least one of the quantified atom:

    my $some_a = qr/ca+t/;

    like( 'cat',    $some_a, "'cat' matches /ca+t/" );
    like( 'caat',   $some_a, "'caat' matches/"      );
    like( 'caaat',  $some_a, "'caaat' matches"      );
    like( 'caaaat', $some_a, "'caaaat' matches"     );

    unlike( 'ct',   $some_a, "'ct' does not match"  );

There is no theoretical limit to the maximum number of quantified atoms which can match.

The zero or more quantifier, *, matches zero or more instances of the quantified atom:

    my $any_a = qr/ca*t/;

    like( 'cat',    $any_a, "'cat' matches /ca*t/" );
    like( 'caat',   $any_a, "'caat' matches"       );
    like( 'caaat',  $any_a, "'caaat' matches"      );
    like( 'caaaat', $any_a, "'caaaat' matches"     );
    like( 'ct',     $any_a, "'ct' matches"         );

As silly as this seems, it allows you to specify optional components of a regex. Use it sparingly, though: it's a blunt and expensive tool. Most regular expressions benefit from using the ? and + quantifiers far more than *. Precision of intent often improves clarity.

Numeric quantifiers express the number of times an atom may match. {n} means that a match must occur exactly n times.

    # equivalent to qr/cat/;
    my $only_one_a = qr/ca{1}t/;

    like( 'cat', $only_one_a, "'cat' matches /ca{1}t/" );

{n,} matches an atom at least n times:

    # equivalent to qr/ca+t/;
    my $some_a = qr/ca{1,}t/;

    like( 'cat',    $some_a, "'cat' matches /ca{1,}t/" );
    like( 'caat',   $some_a, "'caat' matches"          );
    like( 'caaat',  $some_a, "'caaat' matches"         );
    like( 'caaaat', $some_a, "'caaaat' matches"        );

{n,m} means that a match must occur at least n times and cannot occur more than m times:

    my $few_a = qr/ca{1,3}t/;

    like( 'cat',    $few_a, "'cat' matches /ca{1,3}t/" );
    like( 'caat',   $few_a, "'caat' matches"           );
    like( 'caaat',  $few_a, "'caaat' matches"          );

    unlike( 'caaaat', $few_a, "'caaaat' doesn't match" );

You may express the symbolic quantifiers in terms of the numeric quantifiers, but the symbolic quantifiers are shorter and get used more often.

Greediness

The + and * quantifiers are greedy: they try to match as much of the input string as possible. This is particularly pernicious. Consider a naïve use of the "zero or more non-newline characters" pattern of .*:

    # a poor regex
    my $hot_meal = qr/hot.*meal/;

    say 'Found a hot meal!'
        if 'I have a hot meal' =~ $hot_meal;

    say 'Found a hot meal!'
         if 'one-shot, piecemeal work!' =~ $hot_meal;

Greedy quantifiers start by matching everything at first. If that match does not succeed, the regex engine will back off one character at a time until it finds a match.

The ? quantifier modifier turns a greedy-quantifier non-greedy:

    my $minimal_greedy = qr/hot.*?meal/;

When given a non-greedy quantifier, the regular expression engine will prefer the shortest possible potential match. If that match fails, the engine will increase the number of characters identified by the .*? token combination one character at a time. Because * matches zero or more times, the minimal potential match for this token combination is zero characters:

    say 'Found a hot meal'
    if 'ilikeahotmeal' =~ /$minimal_greedy/;

Use +? to match one or more items non-greedily:

    my $minimal_greedy_plus = qr/hot.+?meal/;

    unlike( 'ilikeahotmeal',   $minimal_greedy_plus );

    like( 'i like a hot meal', $minimal_greedy_plus );

The ? quantifier modifier applies to the ? (zero or one matches) quantifier as well as the range quantifiers. It always causes the regex to match as little of the input as possible.

Regexes are powerful, but they're not always the best way to solve a problem. This is doubly true for the greedy patterns .+ and .*. A crossword puzzle fan who needs to fill in four boxes of 7 Down ("Rich soil") will find too many invalid candidates with the pattern:

    my $seven_down = qr/l$letters_only*m/;

If she runs this against all of the words in a dictionary, it'll match Alabama, Belgium, and Bethlehem long before it reaches the answer of loam. Not only are those words too long, but the matches start in the middle of the words.

Regex Anchors

It's important to know how the regex engine handles greedy matches—but it's equally as important to know what kind of matches you do and don't want. Regex anchors force the regex engine to start or end a match at a fixed position. The start of string anchor (\A) dictates that any match must start at the very beginning of the string:

    # also matches "lammed", "lawmaker", and "layman"
    my $seven_down = qr/\Al${letters_only}{2}m/;

The end of line string anchor (\z) requires that a match end at the very end of the string.

    # also matches "loom", but an obvious improvement
    my $seven_down = qr/\Al${letters_only}{2}m\z/;

You will often see the ^ and $ assertions used to match the start and end of strings. ^ does match the start of the string, but in certain circumstances it can match just after a newline within the string. Similarly, $ does match the end of the string (just before a newline, if it exists), but it can match just before a newline in the middle of the string. \A and \z are more specific and, thus, more useful.

The word boundary anchor (\b) matches only at the boundary between a word character (\w) and a non-word character (\W). That boundary isn't a character in and of itself; it has no width. It's invisible. Use an anchored regex to find loam while prohibiting Belgium:

    my $seven_down = qr/\bl${letters_only}{2}m\b/;

Metacharacters

Perl interprets several characters in regular expressions as metacharacters, characters represent something other than their literal interpretation. You've seen a few metacharacters already (\b, ., and ?, for example). Metacharacters give regex wielders power far beyond mere substring matches. The regex engine treats all metacharacters as atoms.

The . metacharacter means "match any character except a newline". Many novices forget that nuance. A simple regex search—ignoring the obvious improvement of using anchors—for 7 Down might be /l..m/. Of course, there's always more than one way to get the right answer:

    for my $word (@words)
    {
        next unless length( $word ) == 4;
        next unless $word =~ /l..m/;
        say "Possibility: $word";
    }

If the potential matches in @words are more than the simplest English words, you will get false positives. . also matches punctuation characters, whitespace, and numbers. Be specific! The \w metacharacter represents all alphanumeric characters (Unicode and Strings) and the underscore:

        next unless $word =~ /l\w\wm/;

The \d metacharacter matches digits (also in the Unicode sense):

    # not a robust phone number matcher
    next unless $number =~ /\d{3}-\d{3}-\d{4}/;
    say "I have your number: $number";

Use the \s metacharacter to match whitespace. Whitespace means a literal space, a tab character, a carriage return, a form-feed, or a newline:

    my $two_three_letter_words = qr/\w{3}\s\w{3}/;

Negated Metacharacters

These metacharacters have negated forms. Use \W to match any character except a word character. Use \D to match a non-digit character. Use \S to match anything but whitespace. Use \B to match anywhere except a word boundary.

Character Classes

When none of those metacharacters is specific enough, you can make your own group of characters into character class by enclosing them in square brackets. A character class allows you to treat a group of alternatives as a single atom.

    my $ascii_vowels = qr/[aeiou]/;
    my $maybe_cat    = qr/c${ascii_vowels}t/;

Interpolation Happens

Without those curly braces, Perl's parser would interpret the variable name as $ascii_vowelst, which either causes a compile-time error about an unknown variable or interpolates the contents of an existing $ascii_vowelst into the regex.

The hyphen character (-) allows you to include a contiguous range of characters in a class, such as this $ascii_letters_only regex:

    my $ascii_letters_only = qr/[a-zA-Z]/;

To include the hyphen as a member of the class, use it at the start or end of the class:

    my $interesting_punctuation = qr/[-!?]/;

... or escape it:

    my $line_characters = qr/[|=\-_]/;

Use the caret (^) as the first element of the character class to mean "anything except these characters":

    my $not_an_ascii_vowel = qr/[^aeiou]/;

Use a caret anywhere but the first position to make it a member of the character class. To include a hyphen in a negated character class, place it after the caret or at the end of the class, or escape it.

Capturing

Regular expressions allow you to group and capture portions of the match for later use. To extract an American telephone number of the form (202) 456-1111 from a string:

    my $area_code    = qr/\(\d{3}\)/;
    my $local_number = qr/\d{3}-?\d{4}/;
    my $phone_number = qr/$area_code\s?$local_number/;

Note especially the escaping of the parentheses within $area_code. Parentheses are special in Perl regular expressions. They group atoms into larger units and also capture portions of matching strings. To match literal parentheses, escape them with backslashes as seen in $area_code.

Named Captures

Perl 5.10 added named captures, which allow you to capture portions of matches from applying a regular expression and access them later. For example, when extracting a phone number from contact information:

    if ($contact_info =~ /(?<phone>$phone_number)/)
    {
        say "Found a number $+{phone}";
    }

Regexes tend to look like punctuation soup until you can group various portions together as chunks. Named capture syntax has the form:

    (?<capture name> ... )

Parentheses enclose the capture. The ?< name > construct immediately follows the opening parenthesis and provides a name for this particular capture. The remainder of the capture is a regular expression.

When a match against the enclosing pattern succeeds, Perl updates the magic variable %+. In this hash, the key is the name of the capture and the value is the portion of the string which matched the capture.

Numbered Captures

Perl has supported numbered captures for ages:

    if ($contact_info =~ /($phone_number)/)
    {
        say "Found a number $1";
    }

This form of capture provides no identifying name and does nothing to %+. Instead, Perl stores the captured substring in a series of magic variables. The first matching capture that Perl finds goes into $1, the second into $2, and so on. Capture counts start at the opening parenthesis of the capture. Thus the first left parenthesis begins the capture into $1, the second into $2, and so on.

While the syntax for named captures is longer than for numbered captures, it provides additional clarity. Counting left parentheses is tedious work, and combining regexes which each contain numbered captures is difficult. Named captures improve regex maintainability—though name collisions are possible, they're relatively infrequent. Minimize the risk by using named captures only in top-level regexes.

In list context, a regex match returns a list of captured substrings:

    if (my ($number) = $contact_info =~ /($phone_number)/)
    {
        say "Found a number $number";
    }

Numbered captures are also useful in simple substitutions, where named captures may be more verbose:

    my $order = 'Vegan brownies!';

    $order =~ s/Vegan (\w+)/Vegetarian $1/;
    # or
    $order =~ s/Vegan (?<food>\w+)/Vegetarian $+{food}/;

Grouping and Alternation

Previous examples have all applied quantifiers to simple atoms. You may apply them to any regex element:

    my $pork  = qr/pork/;
    my $beans = qr/beans/;

    like( 'pork and beans', qr/\A$pork?.*?$beans/,
         'maybe pork, definitely beans' );

If you expand the regex manually, the results may surprise you:

    my $pork_and_beans = qr/\Apork?.*beans/;

    like( 'pork and beans', qr/$pork_and_beans/,
        'maybe pork, definitely beans' );
    like( 'por and beans', qr/$pork_and_beans/,
         'wait... no phylloquinone here!' );

Sometimes specificity helps pattern accuracy:

    my $pork  = qr/pork/;
    my $and   = qr/and/;
    my $beans = qr/beans/;

    like( 'pork and beans', qr/\A$pork? $and? $beans/,
        'maybe pork, maybe and, definitely beans' );

Some regexes need to match either one thing or another. The alternation metacharacter (|) indicates that either possibility may match.

    my $rice  = qr/rice/;
    my $beans = qr/beans/;

    like( 'rice',  qr/$rice|$beans/, 'Found rice'  );
    like( 'beans', qr/$rice|$beans/, 'Found beans' );

While it's easy to interpret rice|beans as meaning ric, followed by either e or b, followed by eans, alternations always include the entire fragment to the nearest regex delimiter, whether the start or end of the pattern, an enclosing parenthesis, another alternation character, or a square bracket.

Alternation has a lower precedence (Precedence) than even atoms:

    like(   'rice',  qr/rice|beans/, 'Found rice'   );
    like(   'beans', qr/rice|beans/, 'Found beans'  );
    unlike( 'ricb',  qr/rice|beans/, 'Found hybrid' );

To reduce confusion, use named fragments in variables ($rice|$beans) or group alternation candidates in non-capturing groups:

    my $starches = qr/(?:pasta|potatoes|rice)/;

The (?:) sequence groups a series of atoms without making a capture.

Non-Captured For Your Protection

A stringified regular expression includes an enclosing non-capturing group; qr/rice|beans/ stringifies as (?^u:rice|beans).

Other Escape Sequences

To match a literal instance of a metacharacter, escape it with a backslash (\). You've seen this before, where \( refers to a single left parenthesis and \] refers to a single right square bracket. \. refers to a literal period character instead of the "match anything but an explicit newline character" atom.

You will likely need to escape the alternation metacharacter (|) as well as the end of line metacharacter ($) and the quantifiers (+, ?, *).

The metacharacter disabling characters (\Q and \E) disable metacharacter interpretation within their boundaries. This is especially useful when taking match text from a source you don't control:

    my ($text, $literal_text) = @_;

    return $text =~ /\Q$literal_text\E/;

The $literal_text argument can contain anything—the string ** ALERT **, for example. Within the fragment bounded by \Q and \E, Perl will interpret the regex as \*\* ALERT \*\* and attempt to match literal asterisk characters instead of treating the asterisks as greedy quantifiers.

Regex Security

Be cautious when processing regular expressions from untrusted user input. A malicious regex master can craft a regular expression which may take years to match input strings, creating a denial-of-service attack against your program.

Assertions

Regex anchors such as \A, \b, \B, and \Z are a form of regex assertion, which requires that the string meet some condition. These assertions do not match individual characters within the string. No matter what the string contains, the regex qr/\A/ will always match..

Zero-width assertions match a pattern. Most importantly, they do not consume the portion of the pattern that they match. For example, to find a cat on its own, you might use a word boundary assertion:

    my $just_a_cat = qr/cat\b/;

... but if you want to find a non-disastrous feline, you might use a zero-width negative look-ahead assertion:

    my $safe_feline = qr/cat(?!astrophe)/;

The construct (?!...) matches the phrase cat only if the phrase astrophe does not immediately follow. The zero-width positive look-ahead assertion:

    my $disastrous_feline = qr/cat(?=astrophe)/;

... matches the phrase cat only if the phrase astrophe immediately follows. While a normal regular expression can accomplish the same thing, consider a regex to find all non-catastrophic words in the dictionary which start with cat:

    my $disastrous_feline = qr/cat(?!astrophe)/;

    while (<$words>)
    {
        chomp;
        next unless /\A(?<cat>$disastrous_feline.*)\Z/;
        say "Found a non-catastrophe '$+{cat}'";
    }

The zero-width assertion consumes none of the source string, leaving the anchored fragment <.*\Z> to match. Otherwise, the capture would only capture the cat portion of the source string.

To assert that your feline never occurs at the start of a line, you might use a zero-width negative look-behind assertion. These assertions must have fixed sizes. You may not use quantifiers:

    my $middle_cat = qr/(?<!\A)cat/;

The construct (?<!...) contains the fixed-width pattern. You could also express that the cat must always occur immediately after a space character with a zero-width positive look-behind assertion:

    my $space_cat = qr/(?<=\s)cat/;

The construct (?<=...) contains the fixed-width pattern. This approach can be useful when combining a global regex match with the \G modifier.

A newer feature of Perl regexes is the keep assertion \K. This zero-width positive look-behind assertion can have a variable length:

    my $spacey_cat = qr/\s+\Kcat/;

    like( 'my cat has been to space', $spacey_cat );
    like( 'my  cat  has  been  to  doublespace',
         $spacey_cat );

\K is surprisingly useful for certain substitutions which remove the end of a pattern. It lets you match a pattern but remove only a portion of it:

    my $exclamation = 'This is a catastrophe!';
    $exclamation    =~ s/cat\K\w+!/./;

    like( $exclamation, qr/\bcat\./,
                          "That wasn't so bad!" );

Everything up until the \K assertion matches, but only the portion of the match after the assertion gets substituted away.

Regex Modifiers

Several modifiers change the behavior of the regular expression operators. These modifiers appear at the end of the match, substitution, and qr// operators. For example, to enable case-insensitive matching:

    my $pet = 'CaMeLiA';

    like( $pet, qr/Camelia/,  'Nice butterfly!'  );
    like( $pet, qr/Camelia/i, 'shift key br0ken' );

The first like() will fail, because the strings contain different letters. The second like() will pass, because the /i modifier causes the regex to ignore case distinctions. M and m are equivalent in the second regex due to the modifier.

You may also embed regex modifiers within a pattern:

    my $find_a_cat = qr/(?<feline>(?i)cat)/;

The (?i) syntax enables case-insensitive matching only for its enclosing group. In this case, that's the named capture. You may use multiple modifiers with this form. Disable specific modifiers by preceding them with the minus character (-):

    my $find_a_rational = qr/(?<number>(?-i)Rat)/;

The multiline operator, /m, allows the ^ and $ anchors to match at any newline embedded within the string.

The /s modifier treats the source string as a single line such that the . metacharacter matches the newline character. Damian Conway suggests the mnemonic that /m modifies the behavior of multiple regex metacharacters, while /s modifies the behavior of a single regex metacharacter.

The /r modifier causes a substitution operation to return the result of the substitution, leaving the original string unchanged. If the substitution succeeds, the result is a modified copy of the original. If the substitution fails (because the pattern does not match), the result is an unmodified copy of the original:

    my $status     = 'I am hungry for pie.';
    my $newstatus  = $status =~ s/pie/cake/r;
    my $statuscopy = $status
                   =~ s/liver and onions/bratwurst/r;

    is( $status, 'I am hungry for pie.',
        'original string should be unmodified' );

    like( $newstatus,    qr/cake/,      'cake wanted' );
    unlike( $statuscopy, qr/bratwurst/, 'wurst not'   );

The /x modifier allows you to embed additional whitespace and comments within patterns. With this modifier in effect, the regex engine ignores whitespace and comments, so your code can be more readable:

    my $attr_re = qr{
        \A                    # start of line

        (?:
          [;\n\s]*            # spaces and semicolons
          (?:/\*.*?\*/)?      # C comments
        )*

        ATTR

        \s+
        (   U?INTVAL
          | FLOATVAL
          | STRING\s+\*
        )
    }x;

This regex isn't simple, but comments and whitespace improve its readability. Even if you compose regexes together from compiled fragments, the /x modifier can still improve your code.

The /g modifier matches a regex globally throughout a string. This makes sense when used with a substitution:

    # appease the Mitchell estate
    my $contents = slurp( $file );
    $contents    =~ s/Scarlett O'Hara/Mauve Midway/g;

When used with a match—not a substitution—the \G metacharacter allows you to process a string within a loop one chunk at a time. \G matches at the position where the most recent match ended. To process a poorly-encoded file full of American telephone numbers in logical chunks, you might write:

    while ($contents =~ /\G(\w{3})(\w{3})(\w{4})/g)
    {
        push @numbers, "($1) $2-$3";
    }

Be aware that the \G anchor will begin at the last point in the string where the previous iteration of the match occurred. If the previous match ended with a greedy match such as .*, the next match will have less available string to match. Lookahead assertions can also help.

The /e modifier allows you to write arbitrary code on the right side of a substitution operation. If the match succeeds, the regex engine will run the code, using its return value as the substitution value. The earlier global substitution example could be simpler with code like:

    # appease the Mitchell estate
    $sequel  =~ s{Scarlett( O'Hara)?}
                 {
                    'Mauve' . defined $1
                            ? ' Midway'
                            : ''
                 }ge;

Each additional occurrence of the /e modifier will cause another evaluation of the result of the expression, though only Perl golfers use anything beyond /ee.

Smart Matching

The smart match operator, ~~, compares two operands and returns a true value if they match. The type of comparison depends on the type of both operands. given (Switch Statements) performs an implicit smart match.

As of Perl 5.18, this feature is experimental. The details of the current design are complex and unwieldy, and no proposal for simplifying things has gained enough popular support to warrant a complete overhaul. The more complex your operands, the more likely you are to receive confusing results. Avoid comparing objects and stick to simple operations between two scalars or one scalar and one aggregate for the best results.

The smart match operator is an infix operator:

    say 'They match (somehow)' if $loperand ~~ $roperand;

The type of comparison generally depends first on the type of the right operand and then on the left operand. For example, if the right operand is a scalar with a numeric component, the comparison will use numeric equality. If the right operand is a regex, the comparison will use a grep or a pattern match. If the right operand is an array, the comparison will perform a grep or a recursive smart match. If the right operand is a hash, the comparison will check the existence of one or more keys. A large and intimidating chart in perldoc perlsyn gives far more details about all the comparisons smart match can perform.

These examples are deliberately simple, because smart match can be confusing:

    my ($x, $y) = (10, 20);
    say 'Not equal numerically' unless $x ~~ $y;

    my $z = '10 little endians';
    say 'Equal numeric-ishally' if $x ~~ $z;

    my $needle = qr/needle/;

    say 'Pattern match' if 'needle' ~~ $needle;

    say 'Grep through array' if @haystack ~~ $needle;

    say 'Grep through hash keys' if %hayhash ~~ $needle;

    say 'Grep through array' if $needle ~~ @haystack;

    say 'Array elements exist as hash keys'
        if %hayhash    ~~ @haystack;

    say 'Smart match elements' if @straw ~~ @haystack;

    say 'Grep through hash keys' if $needle ~~ %hayhash;

    say 'Array elements exist as hash keys'
        if @haystack  ~~ %hayhash;

    say 'Hash keys identical' if %hayhash ~~ %haymap;

Smart match works even if one operand is a reference to the given data type:

    say 'Hash keys identical' if %hayhash ~~ \%hayhash;

It's difficult to recommend the use of smart match except in the simplest circumstances, but it can be useful when you have a literal string or number to match against a variable, as in the case of smart match.

Contents

This book is free!