Orders Orders Backward Forward
Comments Comments
© 1997 The McGraw-Hill Companies, Inc. All rights reserved.
Any use of this Beta Book is subject to the rules stated in the Terms of Use.

Chapter 9: Regular Expressions:

If one were to come up with a list of things that were uniquely Perl, the things that would come to mind would be the way that Perl handles its:

1) data structures

and

2) regular expressions

Even if you are not familiar with Perl, you might already be familiar with regular expressions. If you work with complicated editors (vi, emacs), or have programmed with a C regular expression package, programmed in icon, python, tcl/tk, scripting languages, Visual C++, or even taken finite automata in college (you know who you are), then you are probably familiar with them.

Unequivocally, Perl has more powerful regular expressions than any of these languages or packages. This chapter covers this power in detail.

  1. Chapter Overview

However, with this power comes a complicated syntax. Perl's regular expression set is so powerful is because it is so well integrated into the language. This is a result of it evolving down the years, gradually solving problems that were thrown at it, rather than starting from a preconceived notion of what should be in a regular expression package. In fact, you could probably say that about the language itself!

There will therefore be several parts to this chapter, seven rules that we shall talk about in conjunction with regular expressions, just to ease you into learning them.

The first thing we shall talk about, are the three basic regular expression operators that perl has: m (for matching), s( for substitution ) and tr ( for translating) Of these two, 95% of your code will be either 'matching' or 'substitution' regular expressions. It is rare when you use the 'tr' operator, but can be invaluable in those few cases which you do.

Then we shall start in on the basic regular expression principles; that a perl expression only matches on a scalar, backtracking is used to deal with partial regular expression matches, how regular expressions translate perl variables inside of them, and the use of 'metacharacters' to match if a character is in a certain set of characters)

We continue in this vein talking about perl regular expressions' support for multiple 'groupings' of characters (called multiple match operators), talk about the forms of matching available to perl, greedy and lazy. Then we get back to what are termed 'back references').

Third, we talk about the various subscripts, (i and s and g and x and o - sgxo, that we can add to regular expressions which really let the user customize the functionality inside his/her module to a great degree, if he or she wants to...

And last as always, we get to examples. References are known for being hard to learn, so we give approximately 15 pages of examples of them in action; starting from the very simple ( going from 'getting words in a text file' to actually building a complicated regular expression from scratch - when we build a double quoted string).

Introduction

If you don't believe me that Perl's regular expressions are expressive and/or complicated, just look at the following list of items that perl's regular expressions can do. They can:

  • Match any type of ASCII or binary data

  • Deal with any length of data

  • Have expressions that are up to 65536 (64K) bytes long

  • Iterate over patterns of data, 'remembering' where the pattern matched last

  • Deal with alternate patterns of data (i.e.: matching either an 'a' or a 'b'.)

  • Match 'classes' of characters (uppercase, lowercase, numerics, binary chars, etc.)

  • Deal with 'negative lookups', in which a pattern matches only if it is not followed by another pattern

  • Deal with both 'greedy' matches (ones in which character set matches the greatest amount of characters that it can match) and 'non-greedy' matches (ones in which a character set matches the least amount of characters it can match.)

  • Easily be used to create a lexer, and along with byacc, make a parser.

  • In addition, Perl expression has an extra readability form, which facilitates both your sanity and clearer thinking when dealing with regular expressions. This is a new feature in Perl. If you get nothing else out of this chapter, be sure to understand how to make 'readable expressions'.

    Anyway, as Larry Wall would say, that's definitely enough hype. The flipside of regular expressions is that they can notoriously misbehave if you don't know how to use them. So it's very important to understand the usage of regular expressions and, by extension, the general principles that are involved in their construction.

    Perl's regular expression set is so unlike anything else available that even if you have training in other regular expression packages, you will find features in Perl's that simply don't exist in others. Also, Perl sets itself aside by integrating regular expressions intimately, and cleanly, with the rest of the language.

    In fact, this chapter doesn't do Regular Expressions in Perl justice. There is simply too much information to fit into one chapter. If you want more information, I suggest referring to the perlref page in the FAQ.

    There is also a very well done book 'Mastering Regular Expressions' by Jeffrey Friedl (listed in the bibliography of this book). It has many well designed, thoughtful examples, and goes into much more depth about the material listed below (a couple of examples in this chapter were included by permission of the author, commented below.)

    Basics of Perl Regular Expressions

    A constant source of questions/comments on the newsgroup 'comp.lang.perl.misc' is about bugs in the string matching operators of Perl, its regular expression engine.

    This is a common mistake that both beginners and advanced programmers make. The Perl regular expression engine is not buggy (well, not that buggy). It is, in the best tradition of computer science, frustratingly logical.

    Perl is frustratingly logical, which can make it intimidating to the beginner. The Perl regular expression engine obeys your commands to the symbol, but often those commands don't do exactly what you want them to do. It's an example of people telling the computer to "do what I think, not what I said.".

    The best, surest way, to learn and use regular expressions is to take simple examples and then progress to more complicated examples, while first taking to heart principles underlying their construction. If you are new to them, the thorough understanding of regular expressions make you instantly more productive in tasks ranging as far and wide as:

    1) code analysis,

    2) correcting misspellings

    3) sorting through tons of data.

    4) code generators

    For example, genetic engineers are extremely happy using Perl for looking for patterns in gene sequences. And, on a more personal note, some of the formatting of this book was done with regular expressions.

    For those of you are unfamiliar with regular expressions, here's a concise definition:

  • Regular Expression: a pattern which uses a logical notation to represent a set of possible strings.

  • That's all there really is to it (and that's what 10 weeks and $500 told me in college). A group of strings, such as 'cat', 'cats' 'catty', 'tomcat', 'alleycat', etc., can be matched by one regular expression:

    $catstring =~ m/cat/; # matches any of the above strings. Returns a '1'.

    This is, in fact, the way Perl looks for 'cat' in every one of the above strings (i.e.: tomcat, alleycat).

    Like many of the items that makes Perl so strong, regular expressions can be distilled down to a few principles. Learn the principles, and you can do wondrous things. Although we call them 'basic' regular expressions, they start at the very basic, and move into some fairly advanced materials.

    Principle 1: Three forms of regular expressions: matching, substituting, and translating

    There are three regular expression operators listed below, in Table 9.1:

    Operator Meaning

    m// (in English 'match')

    s/// (in English 'substitute')

    tr/// (in English, 'translate')

     

    Each one of these expressions is explained in more detail below:

  • matching: the form 'm/<regexp>/' indicates that the regular expression inside the '//' is going to be matched against the scalar on the left hand side of the =~ or !~. As syntactic sugar, you can say /<regexp>/, leaving out the 'm'.

  • substituting: the form 's/<regexp>/<substituteText>/ indicates that the regular expression <regexp> is going to be substituted by the text <substititeText>. As syntactic sugar, you can say /<regexp>/<substituteText>/, leaving out the 's'.

  • translating: the form tr/<charClass>/<substituteClass>/ takes a range of characters <charClass> and substitutes them for <substituteClass>.

  • Note that translating isn't really a regular expression, but it is often used in ways to manipulate data that is difficult to do with regular expressions. Hence 'tr/[0-9]/9876543210/' makes the string '123456789', '987654321', etc.

    You bind these expressions to a scalar, by using =~ (in English, 'does' as in "does match") and by !~ (in English 'doesn't' as in "doesn't match") As an example of this, we give six sample regular expressions below, along with their corresponding definitions:

    $scalarName =~ s/a/b/; # substitute the character a for b, and return true if this can happen

    $scalarName =~ m/a/; # does the scalar $scalarName have an a in it?

    $scalarName =~ tr/A-Z/a-z/; # Translate all capital letters with lower case ones, and return true if this happens

    $scalarName !~ s/a/b/; # substitute the character a for b, and return false if this indeed happens.

    $scalarName !~ m/a/; # does the scalar $scalarName match the character a? Return false if it does.

    $scalarName !~ tr/0-9/a-j/; # Translate the digits for the letters a thru j, and return false if this happens.

    If we say something like 'horned toad' =~ m/toad/, this turns into Figure 9.1:

    Figure 9.1 (line art)

    Figure 9.1

    Simple pattern match

    In addition, if you are matching against the special variable $_ (as you might do in while loops, map, or grep, you can do without the !~ and =~. Hence, all of the following work:

    my @elements = ('a1','a2','a3','a4','a5');

    foreach (@elements) { s/a/b;}

    This makes @elements equal to ('b1','b2','b3','b4','b5').

    while (<$FD>) { print if (m/ERROR/); }

    prints out all the lines with the string error in them. And:

    if (grep(/pattern/, @lines) { print "the variable \@lines has pattern in it!\n"; }

    prints something only if lines have the pattern 'pattern' in them. This bears directly on the next principle.

    Principle 2: Regular Expressions match only on scalars.

    Note the importance of scalars here. If you try something such as:

    @arrayName = ('variable1', 'variable2');

    @arrayName =~ m/variable/; # looks for 'variable' in the array? No! use grep instead

    Then @arrayName matching is not going to work! It gets interpreted as '2' by Perl, and this means you are saying:

    '2' =~ m/variable/;

    This is not going to give expected results, to say the least. If you want to do this, say:

    grep(m/variable/, @arrayName);

    which loops through each of the elements in @arrayName, returning (in scalar context) the number of times it matched, and in array context, the actual list of elements that matched.

    Principle 3: A regular expression matches the earliest possible match of a given pattern. By default, it only matches or replaces a given regular expression ONCE.

    This principle uses a process called backtracking to figure out how to match a given string. If it finds a partial match, then finds something that invalidates that match, it backtracks the least possible amount in the string that it can without missing any matches.

    This is probably the most helpful principle to understand what the regular expression is doing, and you don't need Perl-ish forms to understand what it is doing. Suppose you had the following pattern:

    'silly people do silly things if in silly moods'

    And you wanted to match the pattern

    'silly moods'

    What happens then is the regular expression engine matches 'silly ', then hits the 'p' in people. At that point, the regular expression engine understands that the first 'silly' won't match, so it moves up to the 'p' and keeps trying to match. It then hits the second 'silly', and then tries to match 'moods'. It gets a 't' (in 'things' instead), hence moves up to the 't' in 'things', and keeps trying to match. The engine then hits the third 'silly', and tries to match 'moods'. It does so, and the engine finally matches. Pictorially, this becomes something like Figure 9.2:

    Figure 9.2

    Figure 9.2

    Simple backttracking

    Backtracking will become very important when we get to wildcards. If there are several wildcards in the same regular expression, all intertwined, then there are pathological cases where backtracking becomes very expensive. If you see an expression such as:

    $line = m/expression.*matching.*could.*be.*very.*expensive.*/

    The '.*' indicates a 'wild card', which means, 'match any character (besides newline) zero or many times'. Then it is possible that this could take a LONG time, if there are possible matches at the end of the string which don't work, since the engine will backtrack like crazy. See the principle on wildcards for more information on this.

    If you see something like this, you can usually split up your regular expression into parts. In other words, simplify your regular expression.

    Principle 4: Regular expressions can take ANY and ALL characters that double quoted strings can.

    This principle explains . In the first compartment of the s/// operator (s/*//), or the m// operator (m/*/), the items inside here are treated exactly like double quoted strings (with some extra added functionality, namely special, regular expression characters. See below.) You can interpolate with them:

    $variable = 'TEST';

    $a =~ m/${variable}aha/;

    and

    $a = "${variable}aha";

    both point to the same string, the first matches the string 'TESTaha' in $a, the second sets $a to the string 'TESTaha'.

    Since it is true that regular expressions can take every single character that a double quoted string can take, you can do things such as:

    $expression = 'hello';

    @arrayName = ('elem1','elem2');

     

    $variable =~ m/$expression/; # this equals m/hello/;

    Here, we simply expand $expression into 'hello' to get m/hello/. This trick works for arrays as well:

    $variable =~ m/@arrayName/; # this equals m/elem1 elem2/;

    Here, this is equal to m/elem1 elem2/. If the special variable $" was set to '|', this would be equal to m/elem|elem2/, which as we shall see, matches either 'elem' or 'elem2' in a string. This works for special characters too:

    $variable =~ m/\x01\27/; # match binary character x01, and

    # octal character 27.

     

    $variable =~ s/\t\t\t/ /; # substitute three tabs for three spaces.

    In fact, with few exceptions, which we shall talk about now, Perl handles string processing in m// exactly as if it were in double quotes. But there are exceptions. There are certain characters that have significance to the regular expression engine itself. What happens, then, if you want to match something such as a forward slash ('/') or parentheses '(' ')'? These characters have special significance to the regular expression engine. You can't say something such as:

    $variable =~ m//usr/local/bin/; # matches /usr/local/bin? NO! SYNTAX ERROR

    because Perl will interpret the '/' as being the end of the regular expression. There are three possible ways to match something like the above. They are listed below:

    1) You can use a backslash to 'escape' whatever special character you want to match. This includes backslashes! Hence the above becomes:

    $path =~ m/\/usr\/local\/bin/;

    This tries to match /usr/local/bin in $path.

    2) You can use a different regular expression character. Using backslashes gets ugly fast if you have a lot of special characters to match (path characters are especially bad, as you can see above).

    Fortunately, Perl has a form of syntactic sugar which helps out quite a bit here.

    Since you need to backslash every '/' when you say m// or s///, Perl can allow you to change the regular expression delimiter ('/') into any character that you would like. For example, we can use a double parenthesis (") to avoid lots of backslashing:

    $variable =~ m"/usr/local/bin"; # Note the quotation marks.

    $variable =~ m"\"help\""; # if you are going to match quotation

    # marks, you need to backslash them here. (as per \")

    $variable =~ s"$variable"$variable2"; # works in s/// too.

    We have already used this convention earlier in the book, and for good reason. If you start using " as your regular expression character, it serves as a very good mnemonic to remember that what you are dealing with here is actually string interpolation in disguise. Likewise, quotation marks are a lot less frequent than slashes.

    Perl also allows you to use '{ }' or '( )' or [ ] to write regular expressions:

    $variable =~ m{ this works good with vi or emacs because the parens bounce };

    $variable =~ m( this also works good );

    $variable =~ s{ substitute pattern } { for this pattern }sg;

    This principle will come in very handy when we start dealing with multiple line regular expressions below. And since you can bounce parens here, you can start treating them as 'mini functions' if you have a reasonably intelligent editor like emacs or vi. In other words, you can bounce between the beginning and ending of the expressions.

    3) You can use the quotemeta( ) function to automatically backslash things for you. If you say something like:

    $variable =~ m"$scalar";

    then $scalar will be interpolated, turned into the value for scalar. There is a caveat here. Any special characters will be acted upon by the regular expression engine, and may cause syntax errors. Hence if scalar is:

    $scalar = "({";

    Then saying something like:

    $variable =~ m"$scalar";

    is equivalent to saying: $variable =~ m"({"; which is a runtime syntax error. If you say:

    $scalar = quotemeta('({');

    instead will make $scalar become '\(\{' for you, and substitute $scalar for:

    $variable =~ m"\{\{";

    Then, you will match the string '({' as you would like.

    Principle 5: A regular expression creates two things in the process of being evaluated: result status and backreferences Every single time you evaluate a regular expression, you get two things:

  • an indication of how many times the regular expression matched your string (result status)

  • a series of variables called backreferences if you wish to save parts of the match

  • Lets go over each one of these in turn.

    Result Status

    As said, a result status is an indication of how many times a given regular expression matched your string. The way you get a result status is to evaluate the regular expression in SCALAR context. All the following examples use this result variable.

    $pattern = 'simple always simple';

    $result = ($pattern =~ m"simple");

    Here, result becomes one, since the pattern simple is in 'simple always simple'. Likewise, given 'simple always simple':

    $result = ($pattern =~ m"complex");

    makes result NULL since 'complex' isn't a substring inside 'simple always simple', and

    $result = ($pattern =~ s"simple"complex");

    makes result equal to 1 since the substitution from simple to complex works. Going further:

    $pattern = 'simple simple';

    $result = ($pattern =~ s"simple"complex"g);

    gets a little more complicated. Here, $result becomes 2, since there are two occurrences of 'simple' in 'simple simple', and the 'g' modifier to regular expressions is used, which means 'match as many times as you can'. (See modifiers below for more detail.). Likewise,

    $pattern = 'simpler still';

    if ($pattern =~ m"simple")

    {

    print "MATCHED!\n";

    }

    uses $pattern =~ m"simple" in an if clause, which basically tells Perl to print out 'Matched!' if the pattern $pattern contains the substring 'simple'.

    Backreferences

    Backreferences are a bit more complicated. Suppose you want to save some of your matches for later use. To facilitate this, Perl has an operator (the parentheses '()' ) which you can put around a given set of symbols that you wish to match.

    Putting parentheses around a pattern inside a regular expression simply tells the interpreter 'hey, I wish to save that data'.

    The Perl interpreter obliges, and then saves the match that it finds in a special set of variables ($1, $2,$3.... $65536), which can be used to refer to the first, second, third, etc. parenthesis matches. These variables can then be accessed either by looking at the relevant variable, or by evaluating the regular expression in ARRAY context.

    Example:

    $text = "this matches 'THIS' not 'THAT'";

    $text =~ m"('TH..')";

    print "$1\n";

     

    Here, the characters 'THIS' are printed out - Perl has saved them for you in $1 which later gets printed.

    However there are more things that this example shows:

    1) wildcards ( the character dot ('.') matches any character). If 'THIS' wasn't in the string, the pattern (TH..) would have happily matched 'THAT'.

    2) that regular expressions match the first occurrence on a line. 'THIS' was matched because it came first. And, with the default regexp behavior, 'THIS' will always be the first string to be matched. (you can change this default by modifiers - see below)

    Figure 9.3 shows more about how this is working:

    Figure 9.3 (line art)

    Figure 9.3

    Simple backreferences

    where each parentheses goes along with its own, numeric variable.

    Here are some more examples:

    $text = 'This is an example of backreferences';

    ($example, $backreferences) = ($text =~ m"(example).*(backreferences)");

    Again, here we use a wildcard to separate two text strings 'example', and 'backreferences'. These are saved in $1, and $2, which are then immediately assigned to $example and $backreferences. This is illustrated in Figure 9.4:

    Figure 9.4 (line art)

    Figure 9.4

    Direct assignment of references

    Notice, however, that this only occurs when the text string matches. When the text string does not match, then $example and $backreferences would be empty. Here is pretty much the same example, wrapped in an if statement which prints out $1 and $2 only if they match:

    if ($text =~ m"(example).*(back)")

    {

    print $1; # prints 'example' -- since the first parens matches the text example.

    print $2; # prints 'back' -- since the second parens matches the text back

    }

    So, what happens if your regular expression does not match at all? If you take the following pattern:

    $text = 'This is an example of backreferences';

    $text =~ s"(exemplar).*(back)"doesn't work";

    print $1;

    $1 will not get assigned since the regular expression didn't work. More important, Perl won't tell you that it hasn't assigned $1 to anything. This last example shows two important points about regular expressions:

    1) A regular expression is an 'all or nothing' deal. Just because the string 'back' matches inside the pattern

    'This is an example of backreferences'

    does not mean that the entire expression set matches. Since 'exemplar' is not in this string, the substitution fails.

    2) Backreferences do not get set if a regular expression fails. You can't be sure what this is going to print out! This is the cause for much consternation and is a frequent Perl 'gotcha' when tracking down a logic problem. $1 is simply a regular variable, and (contrary to some Perl myths out there) does not get set to 'blank' if the regular expression fails. Some people think this a bug, others a feature.

    Nonetheless, this second point becomes painfully obvious when you consider the following code.

    1 $a = 'bedbugs bite';

    2 $a =~ m"(bedbug)"; # sets $1 to be bedbug.

    3

    4 $b = 'this is nasty';

    5 $b =~ m"(nasti)"; # does NOT set $1 (nasti is not in 'this is nasty').

    6 # BUT $1 is still set to bedbug!

    7 print $1; # prints 'bedbug'.

    In this case, $1 is the string 'bedbug', since the match in line 5 failed! If you were expecting 'nasti', well, that is your problem. This Perlish behavior can cause hours of bloodshot eyes and lost sleep. So, consider yourself warned

    Common Constructions for Using Backreferences:

    Or more to the point, use the following rules.

    If you want to avoid this very common bug (in which you expect a match, but do not get one and end up using a previous match instead), simply use one of the following three constructions in assigning backreferences to variables.

    1) the short circuiting method. Check for the match, and if it occurs, then and only then assign using '&&'. Example:

    ($scalarName =~ m"(regular expression)") && ($match = $1);

    2) if clause. put your match in an if then clause, and if that if clause in matching is true, then and only then will the pattern be assigned.

    if ($scalarName =~ m"(nasti)") { $matched = $1; }

    else { print "$scalarName didn't match"; }

    3) direct assignment: Since you can assign a regular expression directly to an array, take advantage of this all the time:

    ($match1, $match2) = ($scalarName =~ m"(regexp1).*(regexp2));

    All of your pattern matching code should look like one of the previous three examples. Without these forms, you are definitly coding without a seat belt. And this will save you tons of time, since you never will have this type of bug.

    Using Backreferences in the Regular expression itself

    When you wish to use the 's" " "' operator, or in the case of some complicated patterns that are otherwise difficult to match with the 'm" "' operator, Perl provides a very helpful functionality of which you should be aware.

    This is that backreferences are available to the regular expression itself.

    In other words, if you put parentheses around a group of characters, you need not wait until the regular expression is over in order to use them. If you want to use the backreferences in the second (underlined) part of 's" " "', you use the syntax $1, $2, etc. If you want to use the backreferences in 'm" "' or the first (underlined) part of the 's" " "', you use the syntax \1, \2, etc. Here are some examples:

    $string = 'far out';

    $string =~ s"(far) (out)"$2 $1"; # This makes string 'out far'.

    We simply switch the words here, from 'far out' to 'out far'.

    $string = 'sample examples';

    if ($string =~ m"(amp..) ex\1") { print "MATCHES!\n"; }

    This example is a bit more complicated. The first pattern (amp..) matches the string 'ample'. This means that the whole pattern becomes the string 'ample example', where the underlined text corresponds to the '\1'. Hence, this matches 'sample examples'.

    Below is a more complicated example of the same vein.$string = 'bballball';

    $string =~ s"(b)\1(a..)\1\2"$1$2";

    Let's look at this example in detail. This does match, but it isn't obvious as to why this matches. There are 5 steps to the match of this string:

    1) The first b in parentheses matches the beginning of the string, and is saved into \1 and $1.

    2) \1 then matches the second 'b' in the string, because it is equal to b, and the second character so happens to be 'b'.

    3) (a..) matches the string 'all' and is stored into \2 and $2.

    4) \1 matches the next 'b'

    5) \2, since it is equal to 'all', matches the next and last three characters (all).

    Put it all together and you get the regular expression matching 'bballball', or the whole string. Since $1 equals 'b' and $2 equals 'all', the whole expression:

    $string = 'bballball';

    $string =~ s"(b)\1(a..)\1\2"$1$2";

    translates (in this case) into

    $string =~ s"(b)b(all)ball"ball";

    Or, in the vernacular, substitute the string 'bballball' for 'ball'.

    The regular expression looks pretty much like it does in Figure 9.5:

    Figure 9.5

    Figure 9.5

    Complicated backreferences in s" " ".If you understand the last example, you are pretty far along the way in understanding how Perl's regular expressions work. (They can and do get worse though!)

    Nested Backreferences

    Nested backreferences are nice when you want to match strings which are too complicated to be matched 'in a single order' or one string following after the other. For example, the following expression:

    m"((aaa)*)";

    uses the * to match multiple occurrences of 'aaa': it matches '' or 'aaa', or 'aaaaaa', or 'aaaaaaaaa'. In other words, Perl matches patterns with multiple of 3 a's in a row. But it will NOT match 'aa'. Suppose you want to do something like match the string:

    $string = 'softly slowly surely subtly'

    Nested parens are used and the following regular expression will match:

    $string = m"((s...ly\s*)*)"; # note nested parens.

    Here, the outermost (( )) parentheses captures the whole thing: 'softly slowly surely subtly'. The innermost (()) parentheses captures a combination of strings beginning with an 's' and ending with a "ly" followed by spaces. Hence, it first captures 'softly', throws it away then captures 'slowly', throws it away then captures 'surely', then captures 'subtly'.

    Note that there is a problem here. What order do the backreferences come out in? You can get caught quite easily in this problem. Does the outer parentheses (( )) come first or the inner parentheses? The simple answer is to remember these three rules:

    1. the earlier a backreference is in an expression, the lower its backreference number. As in:

    $var =~ m"(a)(b)";

    In this case, the backreference

    becomes $1, and (b) becomes $2.

    2. the more general a backreference is, the lower its backreference number. As in:

    $var =~ m"(c(a(b)*)*)";

    In this case, the expression with everything in it (m"(c(a(b)*)*)") becomes '$1'. The expression with the 'a' nested inside it (m"(c(a(b)*)*)"), becomes '$2'. The expression with the 'b' nested inside it (m"(c(a(b)*)*)"), becomes '$3'.

    3. in case of conflicts between the two rules, rule #1 wins out. In the statement $var =~ m"(a)(b(c))", (a) becomes $1, (b(c)) becomes $2, and (c) becomes $3.

    Hence, in this case, ((s...ly\s*)*) becomes $1, and ((s...ly\s*)*) becomes $2.

    Be sure to note that there is a second problem. Lets go back to our original, complicated, regular expression:

    $string = 'softly slowly surely subtly'

    $string = m"(((s...ly\s*)*)"; # note nested parens.

    What does '(s...ly\s*)* match? It matches multiple things; first 'softly ', then 'slowly ', then 'surely ', and finally 'subtly'. Since it matches multiples, Perl throws away the first matches, and $2 becomes 'subtly'.

    Even with these rules, nested parentheses can still be confusing. The best thing to do in this case is simply practice. Once again, do many regular expressions with various combinations of this logic, and then present these to the Perl interpreter. This allows you to see what order the backreferences are resolved.

    Principle 6: The heart of the power of regular expressions lies in the wildcard and multiple match operator.

    The wildcard operator lets you match more than one character in a string. If you are dealing with binary data, wildcard matches a range of characters. The multiple match operator lets you match zero, one, or many characters.

    The examples that we have looked at so far were instructive as far as teaching the basics of Perl, but they weren't very powerful. In fact, you could probably write a C subroutine to do any of the above. The Perl regular expression set derives its power from the ability to match multiple patterns of text, that is the ability to represent many distinct patterns of data by the logical 'shorthand' mentioned above. Perl just so happens to have the best shorthand available.

    Wildcards

    Wildcards represent classes of characters. Suppose you had the following strings, but you didn't know if they were capitalized or not:

  • kumquat

  • Kristina

  • Kentucky

  • key

  • keeping

  • In this case, the following Perl expression would match the first letter of each word:

    [Kk]

    This is the example of character class. All wildcards in Perl can be represented by taking a '[', putting the class of characters you wish to match between the brackets, and then closing the ']'. The above wildcard tells the regular expression engine 'OK -- I'm looking for either a "K" or a "k" here. I'll match if I find either one'. Here are some more examples of wildcards in action:

    $scalarName = 'this has a digit (1) in it';

    $scalarName =~ m"[0-9]"; # This matches any character between 0 and 9, that is, matches any digit.

    $scalarName =~ 'this has a capital letter (A) in it';

    $scalarName =~ m"[A-Z]"; # This matches any capital letter (A-Z).

    $scalarName =~ "this does not match, since the letter after the string 'AN ' is an A";

    $scalarName =~ m"an [^A]";

    The first two examples are fairly straightforward. '[0-9]' matches the digit '1' in 'this has a digit (1) in it'. '[A-Z]' matches the capital 'A' in 'this has a capital letter (A) in it'. The last example is a little bit trickier. Since there is only one 'an' in the pattern, the only characters that can possibly match are the last four 'an A'.

    However, by asking for the pattern 'an [^A]' we have distinctly told the regular expression to match 'a', then 'n', then a space, and finally a character that is NOT an 'A'. Hence, this does not match. If the pattern was 'match an A not an e', then this would match, since the first 'an' would be skipped, and the second matched! Lik

    $scalarName = "This has a tab( )or a newline in it so it matches";

    $scalarName =~ m"[\t\n]" # Matches either a tab or a newline.

    # matches since the tab is present.

    This example illustrates some of the fun things that can be done with matching and wildcarding. One, the same characters that you can have interpolated in a " " string also get interpolated in both a regular expression and inside a character class denoted by a brackets ([\t\n]). Here, "\t" becomes the matching of a tab, and "\n" becomes the matching of a newline.

    Second, if you put a '^' inside and at the front of your [ ], the wildcard matches the negation of the characters in the grouping. Likewise, if you put a '-' inside the [ ], the wildcard matches the range that you give (in this case, all digits ([0-9]), all capitals [A-Z]). These operators can be combined to get wildcards that are fairly specific:

    $a =~ m"[a-fh-z]"; # matches any lower case letter *except* g.

    $a =~ m"[^0-9a-zA-Z]"; # matches any non-word character. (ie: NOT

    # a character in 0-9, a-z or A-Z)

    $a =~ m"[0-9^A-Za-z]; # a mistake Does not

    # equal the above. Instead matches 0-9,

    # A-Z, a-z, OR A CARET (^).

    $a =~ m"[\t\n ]"; # matches a space character: tab, newline or blank).

    The important thing to note here is the third example. The caret in '[0-9^A-Za-z]' is a literal caret, not a negative, since it appears in the middle of a character class. Hence, if you want a negative character class, always put it at the beginning of the []. Don't forget the [] either: if you do, then you've got a literal string of text, not a character class.

    Common Wildcards

    It so happens that certain wildcards are extremely common, and that you probably don't want to have to say something such as [0-9] every single time you want to match a digit. In that case, Perl has several convenient short hand wildcards which make things easy on the programmer. Here they are, along with an English expression describing what they represent, and the character grouping that they are equivalent to:

  • \D -- matches a digit (character grouping [0-9])

  • \d -- matches a non-digit (character grouping [^0-9]

  • \w -- matches a word character (character grouping [a-zA-Z0-9_] (underscore is counted as a word character here)

  • \W -- matches a non-word character (character grouping [^a-zA-Z0-9_]

  • \s -- matches a 'space' character (character grouping [\t\n ]. (tab, newline, space)

  • \S -- matches a 'non-space' character (character grouping [^\t\n ]).

  • . -- matches any character, except (in some cases) newline (character grouping [^\n]) (matches any character, when you say m"(.*)"s. See modifiers, below.))

  • $ -- although not really a wildcard (it doesn't match any specific character) it is a widely used special character, which matches the 'end of line', if placed at the end of a regular expression. Zero Width Assertion.

  • ^ -- although not really a wildcard either, a special char that matches 'beginning of line' if placed at the beginning of a regular expression. Zero Width Assertion

  • \b, \B -- same as '$' and '^', doesn't match a character, but matches a word boundary (\b) or lack of word boundary (\B). Zero Width Assertion.

  • The first point that we can see from this table is the wildcard ('.'). This gets used fairly often as filler between items, along with the multiple match operators mentioned above. Take a look at the following match:

    $a = 'Now is the time for all good men to come to the aid of their party';

    $a =~ m"(Now).*(party)"; # matches, since '.' matches any

    # character except newline

    # and '*' means match zero or more characters.

    What is going on here is that the '.*' gobbles up all the characters between 'Now' and 'party', and the match is successful. ( All in this context means 'zero or more, as many as possible'. This is called greediness and we shall talk about this when we talk about multiple match operators below.)

    Here are some other examples of wildcards. Note that we use single quoted strings on the left side of the '=~' (this is a simple way to test expressions):

    1 '1956.23' =~ m"(\d+)\.(\d+)"; # $1 = 1956, $2 = 23

    2 '333e+12' =~ m"(\D+)"; # $1 = 'e+';

    3 '$hash{$value}' =~ m"\$(\w+){\$(\w+)}"; # $1 = 'hash', $2 = 'value'

    4 '$hash{$value}' =~ m"\$(\w+){(\W)*(\w+)(\W*)}"; # $1 = '$', $2 = 'hash',

    # $3 = '$', $4 = 'value'.

    5 'VARIABLE =VALUE' =~ m"(\w+)(\s*)=(\s*)(\w+)"; # $1 = 'VARIABLE', $2 = ' ',

    # $3 = '', $4 = 'VALUE'

    6 'catch as catch can' =~ m"^(.*)can$; # $1 = 'catch as catch'

    7 'can as catch catch' =~ m^can(.*)$"; # $1 = 'as catch catch'

    8 'word_with_underlines word2' =~ m"\b(\w+)\b"; # $1 = word_with_underlines

    In each case, we show a different wildcard, in this case using '*' to mean 'match zero or more wild cards in a row', and '+' to mean 'match one or more wild cards in a row'. Some of these examples are useful in themselves Example 5 shows a good way to fortify expressions against errant spaces by using \s*. Example 8 shows a generalized way to match a word. Example 4 shows a relatively general way to match a hash with key.

    However, in particular, Example 1 isn't a generalized way to match a Perl number. This is an exceedingly difficult problem, given all of the formats that Perl supports, and we shall consider it as a problem later on.

    There is another thing to notice from this table: some of the wildcards are labeled 'Zero Width Assertions'. We shall turn to this next.

    Zero Width Assertions and Positive Width Assertions

    The following characters are what you might call positive width assertions (Table 9.2):

    Table 9.2 (positive assertions):

    \D (non-digit),

    \d (digit),

    \w (word),

    \W (non-word)

    \s (space)

    \S (non-space)

    '.' (anything but newline)

     

    These actually 'match' a character in the string. Positive width means they match a character, and the regular expression engine 'eats' them in the process of matching. The other characters are negative width assertions (Table 9.3):

    Table 9.3 (negative assertions):

    ^ (beginning of the string),

    $ (ending of string)

    \b (word boundary)

    \B (non-word boundary)

     

    These don't match a character, they match a condition. In other words,'^cat' will match a string with 'cat' at the beginning of it, but it doesn't match any given character. Take a look at the following expressions:

    $ziggurautString = 'this matches the word zigguraut';

    $ziggurautString =~ m"\bzigguraut\b";

    $ziggurautString =~ m"\Wzigguraut\W";

    The first one matches, because it looks for ziggurat between two non-word characters (word boundaries). The string holds this condition.

    The second one does not match.

    But why doesn't the second one match? The '\W' on the end positive width assertion and therefore has to match a character. And the end of a line not a character, it is a condition. This is an important distinction.

    Furthermore, even if it matched, the second one would 'eat' the character involved. Hence, if you said something like:

    $ziggurautString = "This matches the word zigguraut now';

    $ziggurautString =~ s"\Wzigguraut\W""g;

    This would end up with 'This matches the wordnow', since you have substituted both the word and the intervening spaces. Hence:

  • Zero width assertions like \b and \B can match places where there are no characters. They never 'eat' any characters in the process of a match..

  • Here are some other wildcard matching examples:

    $example = '111119';

    $example =~ m"\d\d\d"; # match the first three digits it can find in the string. Matches '111'.

    $example = 'This is a set of words and not of numbers';

    $example =~ m"of (\w\w\w\w\w)"; # Matches 'of words'.. Creates a backreference

    Note the last example. Since there is an 'of ' at the beginning of the string (before words), the pattern matcher matches this particular 'of ', and NOT the later 'of' (the one before 'numbers). The last example also shows where we are going. It is a real chore to have to print out '\w' five times in order to match 5 word characters. Hence, Perl provides multiple match operators to facilitate matching long patterns. We turn to this next.

    Multiple Match Operators

    There are six multiple match operators in Perl. They are used to avoid coding stuff such as saying '\w' five times in a row, as in the above section. Think of them as a shorthand shortcut.

    The six Perl multiple match operators are:

  • * Match zero, one or many times

  • + Match one or many times

  • ? Match zero, or one time

  • { X } Match 'X' many times exactly

  • { X, } Match 'X' or more times

  • { X, Y } Match 'X' to 'Y' times

  • These two examples are equivalent, but which do you find easier to read?

    $example = 'This is a set of words and not of numbers';

    $example =~ m"of (\w\w\w\w\w)"; # Matches 'of words'.

    $example =~ m"of (\w{5})"; # Usage of { X } form. Matches 5 characters,

    # and backreference $1 becomes the string 'words'.

    Hopefully, you find the second example easier to read. It uses multiple match operators to avoid boring, repetitive code.

    The second example also uses symbols to match an indeterminate number of characters. The regular expression a* matches '', or 'a', or 'aa', or 'aaa' or any number of a's. It matches zero or many a's. This example:

    $example = 'this matches a set of words and not of numbers';

    $example =~ m"of (\w+)";

    matches the string 'words' (of (\w+) eq 'of words'). And

    $example =~ m"of (\w{2,3})"; # Usage of {X, Y}. Matches the string 'wor'

    # (the first three letters of the first match it finds.

    matches the string 'wor' ('of \w{2,3}' equals 'of wor' here) And contrary to intuition, the m"" clause in:

    $example = 'this matches a set of words and not of numbers';

    $example =~ m"of (\d*)";

    does match this string, even though we are looking for digits with \d*. Why? Because \d* means zero to many, and hence it matches zero digits! However,

    $example =~ m"of (\d+)";

    will not match the same string, because you have used a \d+ instead of \d*. And therefore it is looking for one or more digits after the word 'of', which the string does not have.

    Greediness

    Now, all of the examples above show a major point about how the regular expression engine matches a given string with a given expression. That is, by default, multiple match operators are greedy.

    What does greedy mean in this sense? Greedy means that Perl multiple match operators will by default gobble up the maximal amount of characters in a string and still have the ability to make a pattern match. You should learn this very well. Understanding the nature of greedy Perl expressions will save you hours of tracking down weird regular expression behavior.

    Here are a few simple examples of this greedy behavior that can drive a programmer mad. Let's start with the statement:

    $example = 'This is the best example of the greedy pattern match in Perl5';

    Now suppose you want to match the 'is' in this example. Accordingly, you do something such as:

    $example =~ m#This(.*)the#;

    print $1; # This does NOT print out the string 'is'!

    You expect to see 'is' when you print out the $1. What you get is the string:

    'is the best example of'

    Which works as in Figure 9.6:

    Figure 9.6

    Figure 9.6

    Greedy .* with Caveats

    This is because of the greediness of the multiple match operator '*'. The * goes ahead and takes all the characters up to the LAST occurrence of the string 'the' (the one before greedy). And, if you are not careful, you will get unexpected results from using regular expressions.

    More examples:

    $example = 'sam I am';

    $example =~ m"(.*)am"; # matches the string 'sam I '$example = 'RECORD: 1 VALUE: A VALUE2: B';

    $example =~ m"RECORD:(.*)VALUE"; # matches ' 1 VALUE: A ';

    $example = 'RECORD';

    $example =~ m"\w{2,3}"; # matches REC

    The last example shows that even numeric multiple match operators are greedy. Even though there are two word characters in 'RECORD', Perl prefers to match three, since it can. If you said ' 'RE'=~ m"\w{2,3}" this would match only two characters since that is the maximum amount possible.

    Backtracking and multiple wildcards

    OK, you've hung in there this far, now is the time to tackle a thorny subject. As said up above, the combination of wildcards and backtracking can cause some extremely slow performance for regular expressions. If you understand why, this is a good indicator that you are 'getting' regular expressions:

    Take the following example --

    $string =~ m"has(.*)multiple(.*)wildcards";

    What this means to the regular expression is that "OK -- I'm going to look for (in numerical order)":

    1.the pattern 'has', (m"has.*multiple.*wildcards");

    2.the maximum text I can find until I get to the last 'multiple'. (m"has(.*)multiple(.*)wildcards)

    3.the string 'multiple',(m"has(.*)multiple(.*)wildcards");

    4.the maximum text I can find until I get to the last 'wildcards' (m"has(.*)multiple(.*)wildcards")

    5.the string 'wildcards'. (m"has.*multiple.*wildcards");

    Consider then, what happens with the following pattern:

    has many multiple wildcards multiple WILDCARDS

    Well, what happens is:

    1.first -- Perl matches 'has'(i.e.: m"has(.*)multiple(.*)wildcards)

    has many multiple wildcards multiple WILDCARDS

    2.Perl does the m"has(.*)multiple(.*)wildcards part and gobbles up all the characters it can find till it hits the last 'multiple' and matches:

    has many multiple wildcards multiple WILDCARDS

    3.Perl matches the string 'multiple' (i.e.: m"has(.*)multiple(.*)wildcards):

    has many multiple wildcards multiple WILDCARDS

    4.Perl tries to find the string 'wildcards' and fails, and reads out to the rest of the string.

    WILDCARDS does not match 'wildcards'!

    5.Now what does Perl do? Since there was more than one wildcard (*) character, it BACKTRACKS. The last place it could have made a mistake is in step #2, when it gobbled up:

    has many multiple wildcards multiple WILDCARDS

    Hence, it goes back to right after 'has':

    has many multiple wildcards multiple WILDCARDS

    ^goes back here

    6.Now it tries to rectify its mistake, and only gobbles up characters up to the NEXT to last incidence of 'multiple'. Hence the pattern m"has(.*)multiple(.*)wildcards" matches:

    has many multiple wildcards multiple WILDCARDS

    7.'multiple' in m"has(.*)multiple(.*)wildcards" then matches:

    has many multiple wildcards multiple WILDCARDS

    8.then the wildcard matches the space-- (m"has(.*)multiple(.*)wildcards") matches:

    has many multiple( )wildcards multiple WILDCARDS

    9.and finally 'wildcards' (m"has(.*)multiple(.*)wildcards) matches:

    has many multiple wildcards multiple WILDCARDS

    Therefore the whole regular expression matches 'has many multiple wildcards'. It gave the result that might be expected, but it sure took a torturous route to get there!

    To be sure, the Perl algorithm that implements regular expressions has some shortcuts to improve the performance of the above, but the logic above is basically correct.

    I don't hesitate to add that the above example may be the most important one in this chapter, since even Perl veterans incorrectly 'parse' regular expressions now and again (much to their chagrin). Go over it again and again, until the answer is second nature to you. After that, try to trace out the way Perl matches the following:

    $pattern = "afbgchdjafbgche";

    $pattern =~ m"a(.*)b(.*)c(.*)d";

    We'll give you this one for free:

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";)

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- greedy, goes to last 'b'

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";)

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) --- matches g.

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";)

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- backtrack because no 'd'

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- now we take up everything to the next to last b

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";)

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- now the second .* becommes greedy.

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";)

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- still no d. darn. backtracks

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- wildcard becomes less greedy, gobbles to next to last c

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";)

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- there is only one d in the expression and this matches up to it

    afbgchdjafbgche (m"a(.*)b(.*)c(.*)d";) -- a match!

    matches 'afbgchd'.

    Not very pretty. As you can see, if you have more than one greedy, multiple match operators, things can get ugly (and inefficient) fast.

    Possibly the simplest maxim that can be taken out of this example is that the multiple match operators to the left get the first say over their counterparts to the right. A pattern such as:

    m"(.*)(.*)";

    will always, in a string with no newlines in it, make the first backreference contain the whole string, and the second contain nothing. Errors like this are best dealt with by the '-Dr' command line option, as in, "Perl -Dr script.p". We shall talk about this in chapter 21, 'programming for debugging'.

    Anyway, what happens if you don't want to have this greediness? Well, as we shall see in the next section, Perl (unique to any package out there) has the ability to have non-greedy versions of these characters.

    Non-Greedy Multiple Match Operators.

    Greediness can be a blessing, but it can, just as often, be such a hassle! Take a common example of C comments (so common it is a FAQ and in the documentation). Say you want to match the bold text in the following:

    /* this is a comment */ /*another comment */

    Try to think of a greedy solution here. We want to match a '/*' followed by all the text up to and including the next '*/'. If we try:

    m"/\*.*\*/";

    Then this will match:

    /* this is a comment */ /*another comment */

    The whole thing! Again, because the * is greedy.

    Hmm. This is the best greedy solution I could come up with:

    $commentmatcher = m"/\*([^*]*|\**[^/*])*\*/";*

    which is not the most readable of solutions! (Although, again, we could make it much more readable using the m""x, see below). We shall go over this later because understanding this particular expression will help immensely in your learning of regular expressions in general!

    For now, just welcome the non-greedy versions. There is a simple rule to remember them by:

    Simply add a '?' onto the end of any greedy multiple match operator to make it non-greedy.

    Hence, the above commentmatcher becomes:

    $commentmatcher =~ m"/\*(.*?)\*/";

    which still isn't the most readable form, but it sure is a lot better! We can even describe this in simple English: "take a '/*', then take the minimum possible amount of characters, and then take the closing '*/'." Something like Figure 9.7:

    Figure 9.7

    Figure 9.7

    Minimal matching

    'Laziness' is another term that is used to describe '?' The regular expression engine can be thought as lazily marching along, until it hits the first possible expression that it can match. In this case, it is a '*/' that it hits to move it onto the next step. If you say:

    $line =~ m"(.*?)(.*?)(.*?)";

    Guess what will happen. Each of the (.*?) will match nothing.

    Why? Again, the minimum amount of characters that will match here is zero, since you have said * (zero to many). Hence, by matching zero characters, each (.*?) has done its job, and being lazy, passes control onto the next (.*?) which in turn takes zero characters.

    Here are the generalized rules for lazy matchers. Simply append a character class (like ., \d or [123]) onto them to get this lazy behavior:

  • *? Match zero, one or many times, but match the fewest possible # of times

  • +? Match one or many times, but match the fewest possible number of times

  • ?? Match zero, or one time, but match the fewest possible number of times

  • { X }? Match 'X' many times exactly

  • { X, }? Match 'X' or more times, but match the fewest possible number of times

  • { X, Y }? Match 'X' to 'Y' times, but match the fewest possible number of times

  • Here are some more examples of minimal matchers in action, along with the what would be matched with non-greedy ones:

    $example = 'This is the time for all good men to come to the aid of their party';

    $example =~ m"This(.*?)the";

    This matches ' is '. If it was greedy it would match ' is the time for all good men to come to '.$example = '19992113333333333333331';

    if ($example =~ m"1(\d{3,}?)")

    This expression says 'match a 1 followed by three digits (or more)'. However, since there is nothing after the '?', it basically says 'match a 1 followed by three digits'. So this matches '1999'.$example = '1f999133333333333333331';

    if ($example =~ m"1(\d{3,}?)")

    Here, we have the same expression, but the first 1 the pattern matcher finds is disqualified, since it is followed by an 'f'. It then goes to the next 1, and matches 1333.$example = '1f999133333333333333331';

    if ($example =~ m"1(\d{3,}?)1")

    Here, this matches something quite different. We have added the requirement that whatever digits we find with \d{3,}, a 1 has to follow. Hence, even though the pattern matcher is lazy, it has to go to the end of the expression to find a match.

    As you can see, one needs to be very careful with regular expression logic. Surprises abound for the uninitiated. There are endless ways for those who don't know what they are doing to shoot themselves in the foot!

    Learning the principles behind regular expressions is one big step forward. If you want more information, turn to the section 'Perl Debugging', where we shall give more information on debugging Perl regular expressions.

    Principle 7: If you want to match more than one set of characters, Perl uses a technique called alternation

    Alternation

    Alternation is the way to tell Perl that you wish to match one of two or more patterns. In other words, the expression:

    (able|baker|charlie)

    in a regular expression tells Perl "look for either the string 'able' OR the string 'baker' or the string 'charlie'." As an example, start with the following statement:

    $declaration = 'char string[80];' or $declaration = 'unsigned char string[80];'

    Now you want to match strings 'char or unsigned char'. It would be very convenient to match more than one string at a time. The following regular expression matches both:

    foreach $declaration ( 'char string[80]', 'unsigned char string[80]' )

    {

    if ( $declaration =~ m"(unsigned char |char )" )

    {

    print ":$1:"; # prints ':char:' first time around.

    # prints ':unsigned char:' second time around.

    }

    }

    The ( | ) syntax means match either 'unsigned char ' or 'char ' and save the string matched into a backreference. Alternation can be quite subtle, for there is one important thing to remember about its behavior:

  • Alternation always tries to match the first item in the parentheses. If it doesn't match, the second pattern is then tried and so on.

  • This is called left most matching, and it accounts for many of the bugs that people have when they get to regular expressions. Take the above example. Let's say that we switch the order of the items in the parentheses, so the example becomes:

    $declaration = 'unsigned char string[80]';

    $declaration =~ m"(char |unsigned char )";

    Does this match the string 'unsigned char' (as in 'unsigned char string[80]')? No, it matches 'char' (i.e.: unsigned char string[80]. Since the 'char ' string is first in the list, it gets priority over the string 'unsigned char'. The regular expression matches it, and therefore saves the wrong alternation in the backreference.

    This is so common a mistake that there is a good 'bullet' point to be made here:

  • ALWAYS put the highest priority strings to match, the most specific strings, FIRST.

  • If you don't do this, agony awaits. For example:

    $line =~ m"(.*|word)";

    never matches 'word'. This is because 'word' is an instance of .*, that is four arbitrary characters. And since the regular expressions match leftmost, it picks up the '.*' first. Hence:

    $line = "wordstar";

    $line =~ m"(.*|word)";

    will match wordstar (i.e.: the whole thing), and not wordstar. The '.*' matches any character set, and since it is first in the alternation, always takes precedence over the 'word' part. This however, does match 'word' in 'wordstar':

    $line =~ m"(word|.*)";

    since 'word' is first.

    This is helpful also for things like if you don't know whether or not a word will be followed by a delimiter, or an end of line character, or whether or not a word is plural, as in:

    $line = 'words';

    $line = 'word';

    $line =~ m"word(s|$)"sg; # word may be followed by the character '!' or '$'.

    Both the above match. This syntax will match the string 'word' if it is either followed by the end of the string or followed by a 's'. Replace the '$' by:

    $line =~ m"word(s|\b)";

    And you get a good way for dealing with plurals.

    Principle 8: Perl provides extensions to regular expressions, in the (?..) syntax.

    One day in Perl history (around the transition from Perl4 to Perl5), it was decided that in order for the regular expression set to grow, that Perl had to 'get over the metacharacter standard'. Some people seemed to argue that there were too many metacharacters, and some people disagreed, until it was noticed that there weren't too many metacharacters left on the keyboard!

    It was then decided that it would be a good idea to make one distinctive construct that could be used to provide for several more expansions to come. The keyboard was looked at, and it was found that one rather common character ('?') was hardly used anywhere. Hence, it was decided on, and the syntax looks like this:

    (?<special character(s)><text>)

    Here, <special character(s)> represent the extensions, and <text> is the text that that expression acts on. The four most common extensions are shown in Table 9.5:

    Table 9.5 (regular expression extensions)

    Extension Meaning

    (?=<regexp>) matches the next group of text, but doesn't 'eat' it for further matches.

    (?!<regexp>) Only match if not followed by <regexp>

    (?:<regexp>) grouping, but non-backreference creating, parens.

    (?xims) built in modifier to the regular expression.

     

    In addition, there is a now, old (?#comment) operator which would let you embed comments into regular expressions. This is now pretty much obsolete because of the 'x' modifier (see below).

    Otherwise, the above extensions work like any other regular expression construct. You slip 'em into the regular expression itself. I.e.: if you say:

    $line =~ m"I love (?!oranges)";

    This matches 'love figs', but not 'I love oranges', since the (?! prohibits 'oranges' to follow after the string 'I love '. However, note that this would match 'I love orange', or 'I love ripe oranges', since it only prohibits things that start with the string oranges. You could say:

    $line =~ m"I love(?!.*orange)"

    to prohibit these strings.

    In fact, the (?! modifier is probably the most understood construct in the language. People expect it to do stuff like:

    $line =~ m"(?!oranges) I love";

    and have this not match 'oranges I love'. This simply doesn't work. The (?!) construct matches only if the next substring is not oranges. Hence, in this case, the only place that this negates the match is at the beginning of the string -- any other place it doesn't do anything. The regular expression just goes along, and then finds that the next six characters 'aren't oranges'. Hence the requirement is satisfied, and it goes onto the next requirement.

    The other two expressions which are used the most are (?:...) and (?=...). (?:...) for example, makes your regular expressions more efficient by getting rid of unwanted backreferences. If you say:

    $line =~ m"(?:int|unsigned int|char)\s*(\w+)";

    to get variable names, you may not wish to save the type of the variable. Hence the (?:). It matches the type, followed by any spaces, followed by the variable name (\w+). But it saves the variable name in $1, not $2. The (?:) is ignored. This saves on time and memory, especially in large pattern matches.

    The other expression, (?=) is only really useful when you use it with the 'g' modifier, which we shall see below. The 'g' modifier lets you start back at the point in an expression where you just left off, without having to traverse from the beginning. For example, if you had data that looked like:

    BLOCK1 <data> BLOCK2 <data2> BLOCK3 <data3>

    and you wanted to match '<data>' first, '<data2>' second, and '<data3>' third (and last). Well, as it stands, if you said:

    $line =~ m"BLOCK\d(.*?)(BLOCK\d|$)"g;

    this would match the '<data>' on the first run, but then place the 'match pointer' in the wrong place, after the second 'BLOCK'. If you say something like:

    $line =~ m"BLOCK\d(.*?)(?=BLOCK\d)"g;

    This matches the same amount of text, since it says 'match the minimal amount of text between 'BLOCK1' and 'BLOCK2'. It just ignores 'BLOCK2' for the purposes of the next match, so that on the next call to the regular expression can match '<data2>'. Figure 9.10 summarizes up the difference between the two:

    Figure 9.8

    Figure 9.8

    Difference between m"BLOCK(.*?)(?=BLOCK)"g and m"BLOCK(.*?)BLOCK"g

    With the (?=) version we can now say the following:

    BLOCK1 <data> BLOCK2 <data2> BLOCK3 <data3>

    while ($line =~ m"BLOCK\d(.*?)(?=BLOCK\d|$)"g)

    {

    print "$1\n";

    }

    and have this print out '<data>' then '<data2>' then '<data3>. We shall see more of the reasoning behind this, when we get to the 'g' modifier below.

    Summary of Regular Expression Principles

    The previous section should be more than enough to get you working with regular expressions. Although we called them 'basic' regular expressions, we shall see that in various combinations, they make a formidable ally in the fight against data. (And yes, it is a fight sometimes!). The seven principles again were:

  • Principle 1: There are three different forms of regular expressions (matching (m//) substituting (s///), and translating (tr///).

  • Principle 2: Regular Expressions match only on scalars. ($scalar =~ m"a"; works, @array =~ m"a" has @array treated as a scalar, and hence probably does not work)

  • Principle 3: A regular expression matches the earliest possible match of a given pattern. By default, it only matches or replaces a given regular expression ONCE. ($a = 'string1 string2'; $a =~ s"string""; makes $a == '1 string2' )

  • Principle 4: Regular expressions can take ANY and ALL characters that double quoted strings can. ($a =~ m"$varb" expands varb into a variable before matching, hence, $varb = 'a', $a = 'as', $a =~ s"$varb"" makes $a equal to 's'.)

  • Principle 5: A regular expression creates two things in the process of being evaluated: result status and backreferences. if ($a =~ m"varb") tells if $a has any occurrences of the substring varb, $a =~ s"(word1) (word2)"$2 $1" 'turns around' the two words.

  • Principle 6: The heart of the power of regular expressions lies in the wildcard and multiple match operator, and how they operate. $a =~ m"\w+" matches one or more word characters, $a =~ m"\d*" matches zero or more digits.

  • Principle 7: If you want to match more than one set of characters, Perl uses a technique called alternation. If you say m"(cat|dog)" this says, match the string 'cat' or 'dog'.

  • Principle 8: Perl provides extensions to regular expressions, in the (?..) syntax.

  • Whew! How to learn all of these principles? I suggest that you start out simple. If you learn that $a =~ m"ERROR" looks for the substring 'ERROR' inside $a, you've already got a lot more power than you do in a lower level language like C. We shall also give lots of practical examples below, after we talk about two important concepts: modifiers to regular expressions and contexts.

    Modifiers to Regular Expressions

    All the regular expressions in the section above had either the form:

    $a =~ m//; # m" " is synonym, as is m{ }

    or

    $a =~ s///; # s" " " is synonym, as is s { } { }

    Both of these represent the default form of regular expressions. These match or substitute once, starting from the beginning of the expression.

    Suppose we do not want to 'match or substitute once'. Suppose we want to substitute ALL occurrences of 'a' for 'b' in an expression, or we want to match case-insensitively. In other words, suppose we do not want the default behavior. Fortunately, there are some helpful modifiers that we can put on the regular expressions to overload its behavior to do something other than the default. In this modified form, the regular expressions look like this:

    $a =~ m//gismxo; $a =~ s///geismxo;

    with one or more modifiers 'tacked on' to the end to alter the behavior of Perl's expressions. Let's deal with the ones that have features in common between the two ('s','m','i','x','o') and then deal with the ones that have different meanings between operators ('e','g'). We deal with these next.

    Modifiers in both Substitution and Matching

    The constructs 'm""', and 's"""' have many operators in common. They are listed below, in Table 9.2

    Modifier Meaning

    x 'readable' regular expression form.

    i case insensitive regular expression form.

    s treat expression as a 'single string'

    m treat expression as multiple strings.

    o 'compile a regular expression once'

     

    All five modifiers are described in detail below.

    (e) x: Extended readability regular expressions.

    Regular expressions can sometimes become a mess. You have seen it above, yes, but that doesn't go half as far as some of the expressions in real life. Consider the following code which, roughly, matches a subroutine in Perl:

    $line =~ m"sub\s+(\w+)\s+{(.*?)}\s*(?=sub)"s;

    What does this mean? Even if you are an old hand at Perl, the above expression still forces you to think quite a bit, even if it is commented. The lack of white space in particular is irksome, and the number of special characters can give you nausea.

    The 'x' operator is not available in Perl 4. It becomes a particular blessing because it makes it possible to put white space in regular expressions to make them readable, and allows room to put comments in. The expression:

    $line =~ m"sub\s+(\w+)\s+{(.*?)}\s*(?=sub)"s;

    becomes:

    $line =~ m{

    sub\s+ ( \w+ ) \s+ # matches the 'sub' keyword, subroutine name

    # and matches the whitespace afterwards.

    { # opening brace

    ( .*? ) # matches the text of the sub. and saves it for

    # further use.

    } # closing brace

    \s*(?=sub) # the next sub keyword

    }sx; # match as a multi-line string and be readable.

    While still pretty ugly as anything with that many special characters is bound to be, you can see the logic behind the ugliness much more clearly. It resembles the actual thought process of logic more closely. The braces are in their correct places. And since you can put in comments, they help immensely to give a play-by-play of what is happening.

    Note, however, a couple of caveats. Since white space is allowed in the regular expression, and filtered out, the following will not match:

    $line = "multi line string\nhere";

    $line =~ m"multi line string"x; # this does not match the above because

    # the space above GETS MUNGED OUT.

    Note that the 'x' readability function only works in the first bracket of the 'substitute' operator (i.e.: in s{ }{ }) This is because ONLY the first bracket has its values interpolated as a double string. Everything in the second bracket is literal. For example, the following is probably NOT going to do what you want:

    $line = 'aaaaaa'; # we want 'bbbbbb' after the substitute below.

    $line =~ s {

    a

    }

    {

    b

    } gx; # we want to do a 'general' match, ie: match

    # ALL a's for b's. DOESN'T WORK!

    print $line; # prints ' b b... etc'

    # six times over.

    What has happened here is that the item in the second bracket does not have whitespace munged here. Instead, each instance of 'a' (where whitespace does get munged) is substituted for three tabs, a 'b', and then a newline, so you get a mess.

    Readable regular expressions have a large role in letting you keep your sanity when dealing with more complicated things.

    i: case-insensitive matching

    Regular expressions are case-sensitive by default. Using the 'i' indicates that the matching will be done case-insensitive instead.

    $pattern = 'Exercise';

    $pattern =~ s"exer"EXER"i; # matches first four characters of Exercise. (Exer)

    $pattern = 'Edward Peschko';

    $pattern =~s"[a-f]dward"Edmund"gi; # matches 'Edward'. replaces with Edmund.

    In both cases these match, the first turning 'Exercise' into EXERcise', the second turning 'Edward Peschko' into 'Edmund Peschko'.

    The 'i' modifier is pretty much is a shorthand for writing several tedious regular expressions, such as $pattern =~ m"[Ee][Xx][Ee][Rr]";

    (e) s: treat the pattern as a 'single line'.

    Without modifiers, a dot ('.') matches anything but a newline. Sometimes this is helpful. Sometimes it is very frustrating, especially if you have data that spans multiple lines. Consider the following case:

    $line =

    'BLOCK1:

    <text here1>

    END BLOCK

    BLOCK2:

    <text here2>

    END BLOCK'

    Now suppose you want to match the text between blocks <text here[0-9]>:

    $line =~ m{

    BLOCK(\d+)

    (.*?)

    END\ BLOCK # Note backslash. Space will be ignored otherwise

    };

    This does not work. Since the wildcard ('.') matches every character EXCEPT a newline, the regular expression hits a dead end when it gets to the first newline and it STOPS MATCHING RIGHT THERE.

    Sometimes, as in this case, it is helpful to have the wildcard ('.') match EVERYTHING, not just the newline. And, by extension, to have the wildcard (\s) match [\n\t ], not just tabs and spaces. This is what the 's' operator does.

    It tells Perl to not assume that the string you are working on is one line long. The above then does work with an 's' on the end of the regular expression:

    $line =~ m{

    BLOCK(\d+)

    (.*?)

    END\ BLOCK # Note backslash. Space will be ignored otherwise

    }s;

    With the 's' on the end, this now works.

    m: The 'm' operator is the opposite of the 's' operator. In other words, it says 'treat the regular expression as multiple lines, rather than one line.

    This basically makes it so '^' and '$' now match not only the beginning and ending of the string (respectively), but also make ^ match any character after a newline, and make $ match a newline. In the example,

    $line = 'a

    b

    c';

    $line =~ m"^(.*)$"m;

    the 'm' modifier will make the backreference '$1' become 'a' instead of "a\nb\nc".

    (e) o: compile regular expression only once.

    This modifier is helpful when you have a long, long expression. Consider, when you say something like:

    $line =~ m"<very long expression>";

    where '<very long expression>' is a paragraph long, or even pages long. As it stands, each time that Perl hits this regular expression, it compiles it. This takes time, and if your pattern that you need to match is exceedingly complicated, your regular expression will be exceedingly long.

    In Jeffrey Friedl's book, there is an expression that matches email addresses which comes out to 6598 bytes long! Without the 'o' modifier it would be sunk, but if you compile it only once, it is usable.

    However, there is one caveat you should be aware of. If you say:

    $line =~ m"$regex"o;

    you make a promise to Perl that $regex will not change. If it does, Perl will not notice your change. Hence,

    $regex = 'b';

    while ('bbbbb' =~ m"$regex"o) { $regex = 'c'; }

    is actually an infinite loop in Perl. $regex changes, but it is not reflected in the regular expression. (This doesn't however, bind you to one only one regexp per program. Each instance of expressions with 'o' is compiled before usage).

    Modifiers specific to substitution

    The above modifiers ('s','m','x','i','o') apply to both substitution, and matching (s///, m//), but there are a couple of modifiers that are specific to matching. These are 'e' and 'g', listed below.

    (e) g: substitute ALL of the patterns for their equivalents.

    By default, the s/// operator only substitutes the first time it sees something. If you want to substitute every single instance something into something else, use the g operator. The next three examples are equivalent:

    $pattern = 'NUM1 NUM2 NUM3';

     

    $pattern =~ s"NUM"LETTER"g; # substitutes NUM for LETTER.

    $pattern =~ s"num"LETTER"gi; # Note -- you can stack these modifiers.

    # does exactly the same thing as the above.

    while ($pattern =~ s"NUM"LETTER") {}

    All of these make $pattern 'LETTER1 LETTER2 LETTER3'. The first does so with regard to case, the second with regard to case (gi modifiers) and the third does so slowly, since each time s"NUM"LETTER" substitutes once, becoming 'NUM1 LETTER2 LETTER3' first, 'NUM1 NUM2 LETTER3' second, and finally 'NUM1 NUM2 NUM3'.

    (e) e: evaluate the second part of the s/// as a complete 'mini-Perl program' rather than as a string.

    The 'e' modifier for s/// is pretty cool, but also very involved. You can do pretty heavy wizardry with it. We'll just mention it briefly here, with an example. Let's say that you wanted to substitute all of the letters in the following string with their corresponding ASCII number:

    $string = 'hello';

    $string =~ s{ ( \w ) } # we save the $1.

    {ord($1). " ";} egx;

    This example prints out '104 101 108 108 111". Each character was taken in turn here and run through the 'ord' function that turned it into a digit. Needless to say, this can do pretty powerful stuff in a short amount of time. It also runs the risk of being extremely obscure.

    We suggest you use this logic as a last resort, when all of your other 'bag of tricks' has failed. Its use can sometimes hide a cleaner way of doing things. Is the above really clear? Or is this better:

    $string = turnToAscii($string);

     

    sub turnToAscii

    {

    my ($string) = @_;

    my ($return, @letters);

     

    @letters = split(//, $string);

    foreach $letter (@letters)

    {

    $letter = ord($letter) . " " if ($letter =~ m"\w");

    }

    $return = join('', @letters);

    $return;

    }

    This latter example is explicit and easily maintainable. However, it is also over 10 times as long and also a few times slower, so a judgment call has to be made on when to use 'e'.

    Matching, and the 'g' Operator.

    The modifiers, (x, i, s, and e) work just the same with the matching operator 'm//'. There is however, one significant change in how the 'g' operator works, and you will use it quite frequently.

    As was seen before, the 'g' operator in substitution meant that every single instance of a regular expression was replaced. However, this is meaningless in the context of matching. Backreferences indicate one and only one match. Hence, Perl uses the 'g' operator in a different way with 'm//' then as it does with 's///'.

    What Perl does is to attach an iterator to the 'g' operator. When you match once with '$string =~ m" "g', Perl remembers where that match occurs. This means that you can use this to match where you left off. When Perl hits the end of the string, the iterator is reset:

    $line = "hello stranger hello friend hello sam";

    while ($line =~ m"hello (\w+)"sg)

    {

    print "$1\n";

    }

    This outputs

    stranger

    friend

    sam

    and then quits, because the inherent iterator comes to the end of the expression. Note, there is a caveat here. If you are using the 'g' modifier, then ANY modification to the variable being matched via assignment causes this iterator to be reset.

    $line = "hello";

    while ($line =~ m"hello"sg)

    {

    $line = $line;

    }

    This is an infinite loop! So restrain yourself, and avoid modifying your string while you are matching it. (Make a copy instead!)

    Modifiers and Contexts

    If you were not familiar with Perl regular expressions before you started this chapter, your head is probably swimming with different modifiers, methods, special characters, and so forth (I know from experience!) Let's take a moment to look at different forms of how regular expressions are used, and then finish off this chapter with some common examples of regular expressions in use.

    This has everything to do with context. Remember, in Perl, context is king, and if you pay attention to it, you can do a lot of powerful things by just recognizing the context that different expressions are in. Let's simply 'crystallize' the forms we've seen so far, and add a couple of new ones:

    Substitution (no modifier) in scalar context

    This looks like:

    if ($string =~ s"a"b") { print "Substituted Correctly"; }

    What this does is print out 'Substituted Correctly', returning a '1' to the 'if' when the string in fact did match. It also substitutes all the instances of 'a' for b at the same time.

    Substitution ('g' modifier) in scalar context.

    This returns the number of successful matches made. Using this method, if you say something like:

    ($string =~ s"a"b"g) == ($string2 =~ s"a"b"g);

    this will tell you if $string has the same number of 'a's in it as $string2, as well as doing the substitution. If you wanted, you could, of course, do this on whole files:

    undef $/; # munge mode, do whole file.

    my $fh = new FileHandle("File1");

    my $fh2 = new FileHandle("File2");

    (($line = <$fh>) =~ s"a"b") == (($line2 = <$fh2>) =~ s"a"b");

    which counts the number of 'a's in both files, comparing them together, while again, doing the substitution.

    Substitution (no modifier) in Array Context and Substitution ('g' modifier) in Array Context.

    These two are boring. They do exactly the same thing as Substitutions in scalar context.

    Matching in scalar context, no modifiers.

    Here again, this is the same as substitution in a scalar context with no modifiers. If you say:

    if ($line =~ m"a"b") { print "Matched an a!\n"; }

    this simply checks to see if $line has an 'a' in it. If you say:

    if ($line =~ m"\b(\w+)\b" ) { print "$1\n"; }

    this checks to see if $line has any word in it, and then saves that in $1, printing it out. And:

    ($line =~ m"\b(\w+)\b") && (print "$1\n");

    is the same thing, only using short circuiting to print it out.

    Matching in array context, no modifiers

    Here, this matches the first position the regular expression can match, and simply puts the backreferences in a form that is quickly accessible. For example:

    ($variable, $equals, $value) = ($line =~ m"(\w+)\s*(=)\s*(\w+)");

    This takes the first reference (\w+) and makes it $variable, the second reference (=) and makes it $equals, and the third reference (\w+) and makes it $value.

    Matching in array context, 'g' modifier

    This takes the regular expression, applies it as many times as it can be applied, and then stuffs the results into an array that consists of all possible matches. For example:

    $line = '1.2 3.4 beta 5.66';

    @matches = ($line =~ m"(\d*\.\d+)"g);

    will make '@matches' equal to '(1.2, 3.4, 5.66)'. The 'g' modifier does the iteration, matching 1.2 first, 3.4 second, and 5.66 third. Likewise:

    undef $/;

    my $FD = new FileHandle("file");

    @comments = (<$FD> =~ m"/\*(.*?)\*/");

    will make an array of all the comments in the file '$fd'

    Matching in scalar context, 'g' modifier

    Finally, if you use the matching operator in scalar context, you get a behavior that is entirely different from anything else (in the regular expression world, and even the Perl world). This is that 'iterator' behavior we talked about. If you say:

    $line = "BEGIN <data> BEGIN <data2> BEGIN <data3>"

    while ($line =~ m"BEGIN(.*?)(?=BEGIN|$)"sg)

    {

    push(@blocks, $1);

    }

    This then matches the following text (in bold), and stuffs it into @blocks on successive iterations of while:

    BEGIN <data>(%)BEGIN <data2> BEGIN <data3>

    BEGIN <data> BEGIN <data2>(%)BEGIN <data3>

    BEGIN <data> BEGIN <data2> BEGIN <data3>

    We have indicated via a '(%)' where each of the iterations start their matching. Note the use of (?=) in this example too! It is essential to matching the correct way, since if you don't use it, the 'matcher' will get set in the wrong place.

    Examples of Regular Expressions

    Enough theory already! Keep the above information in mind, especially the 'Modifiers and Contexts' part, go get a hot cup of chocolate and relax.

    Remember, there are only 177 different characters you can type with a keyboard. Regular expressions happen to use most of them. But to get down to brass tacks, I feel that they also happen to be worth the effort.

    Below are several different examples of real-life pattern matchings, from the simple to the complicated. They show the power of regular expressions. Understand how to parse them, and you are well on your way to writing Perl on your own.

    Example 1: Words in a text file:

    undef $/;

    use FileHandle;

    my $fh = new FileHandle("$ARGV[0]");

    my @words = (<$fh> =~ m"\b(\w+)\b"g);

    This is a simple example. Here, we simply open up the file given to us by the first argument to the script, i.e.: by typing 'perl5 script.p filename'. Then in line 4, the pattern 'm"\b(\w+)\b"g' iterates over the file, getting all of the words out of it and sticking it into the array words.

    Example 2: Words fitting a given criteria in a file

    Now let's expand on this a little. Suppose we want to get all the words in a file that start with the letter 't':

    my @words = grep(m"^t", (<$fh> =~ m"\b(\w+)\b"g));

    Or equivalently:

    foreach $word (<$fh> =~ m"\b(\w+)\b"g)

    {

    push (@words, $word) if ($word =~ m"^t");

    }

    In each case, the regular expression is in array context, with a 'g' modifier, and therefore it passes back a list of words to the context in which it was called. In the first case, this was the function grep, in the second case, a foreach loop.

    After, this array is passed back, the m"^t" clause then pushes onto the array stack any and all words that start with 't'. This, for example could be used to match words in a list:

    my %words = ('gofer' => 1,'Julie' =>1,'Doc' => 1,'bartender' => 1,'captain' => 1);

    @characters = grep ($words{$_}, (<$fh> =~ m"\b(\w+)\b"g));

    Or perhaps, make your own spelling checker:

    1 use FileHandle;

    2 my (%words, @words);

    3 my $fd = new FileHandle("/usr/dict/words"); # or any dictionary

    4 grep { chop($_); $words{$_} = 1 }, (@words = <$fd>));

    5 undef $/;

    6 my $fh = new FileHandle("$ARGV[0]");

    7 my @misspelled_words = grep(!$words{$_}, (<$fh> =~ m"\b(\w+)\b"g));

    Here, we simply load all of the words from '/usr/dict/words' into a hash in line 4 (we need to say '@words = <$fd>' because the grep modifies the list by chopping off a newline, and we can't directly chop input from a filehandle.

    Then, we put Perl into 'get whole file' mode (line 5) and proceed to slurp the whole file into one long string, cut it up into words, and compare it against the hash to see if any word is not present. (line 7). A lot of work for one line!

    Example3: Times (10:00AM)

    Suppose now that you have a file of the form:

    Picked up nuts and bolts 10:00AM

    Sawed wood 11:00AM

    Sanded 12:30PM

    in which the description is followed by the time at which the deed occurred and you want to turn this into a hash that tells you what you did at what time. There are three steps here.

    1) read in the file

    2) extract out the times and events.

    3) create a hash

    With the use of regular expressions, step #2 becomes fairly straightforward. Our regular expression will look something like this:

    m"^(.*?)(<regexp for time>)\s*$"mg

    Here, matching the comment is easy. We are given that every line is a comment/time pair, and consists of a comment first, and a time second. Hence, matching the comment becomes a simple matter of putting a '.*?' right after the beginning.

    Furthermore, we use the 'm' modifier, so that '^' means 'match any character after either the beginning or a newline', and such that '$' means 'match any character that is a newline, or the end of the string'.

    Now we need only to find what the regular expression for 'time' is. We could roughly think of it as this:

    m"(\d{1,2}:\d{2}\s*[AP]M)";

    The first \d{1,2}: matches '10:', the second \d{2} matches 00-59, and [AP]M matches AM or PM. This expression also so happens to match 99:99PM, but the chances of such a string occurring are fairly slight, so we deem the risk acceptable (In more 'bulletproof' cases, we would have to consider this. See 'Mastering Regular Expressions' for more detail on how to only match 0-11, or 0-23, or 0-59, i.e. 'time numbers'.)

    Anyway, we can now iterate through our file with this expression and make the hash. The code is below:

    1 use FileHandle

    2 use strict;

    3 undef $/;

    4 my $fd = new FileHandle("$ARGV[0]");

    5 my $line = <$fd>; my %commenthash;

    6 while ($line =~ m^"(.*?)(\d{1,2}:\d{2}\s*[AP]M)\s*$"mg)

    7 {

    8 my $comment = $1; my $time = $2;

    9 $commenthash{$time} = $comment;

    10 }

    The loop in 6 through 10 does most of the work, making the hash by taking the results of the regular expression (line 8) and hashifying it (line 9). You could then do whatever you would like with the data.

    Example 4: HTML tags: Substituting Bold Text for Italic

    You are probably aware of these. They are things like

    <H3 FOLDED_ADD_DATE="....">culture</H3>

    or

    <TITLE>my bookmarks</TITLE>.

    In general, we cannot match all of these, with one regular expression, since they can be recursive, like:

    <DL><p>

    <A ...> </A>

    </DL><p>

    Although we could write a recursive subroutine to do so.

    No, in a simple subroutine, the best we can do is pick a list of tags to match, and then use that information to match, assuming that the tags are not recursive. This is usually a safe assumption.

    Now, the general form of a tag is something like this:

    <I> .... </I>

    or

    <A HREF = ...> </A>

    In other words, the first tag consists of a '<', then a string, followed by either nothing, or a space and other text describing the tag, plus a '>'. The tag is closed by a '</STRING>' where STRING equals the same string before.

    To match:

    <B> text <\B>

    we could say:

    m"<B>(.*?)<\B>"

    Assuming, of course, that this tag isn't recursive. How do we generalize this so it matches any tag? Well, lets say we wanted to match bold or italic ('B' or 'I'). And furthermore, we don't know that these strings have text after them (assume we don't know if <B description> is possible). Well, then we could use the following pattern.

    m"<(B|I)(\s.*?)?>"

    Which says to match first, either B or I, and then match (zero or 1 times) the combination of a space plus the minimum amount of characters. Why this complexity? Consider, if we say: m "<(B|I).*?>", then:

    <BODY>

    will match. Hence, it is essential that we put the facts that:

    1) there could not be any more text after the '<'.

    2) if there is any more text, it will be a space followed by any number of characters (up to but not including the next >).

    In pictures, our expression m"<(B|I)(\s.*?)? works like Figure 9.9:

    Figure 9.9

    Figure 9.9

    A regular expression to match HTML tags:

    If we now take (B|I) and substitute it with $pattern, where $pattern = 'B|I', we get:

    m"<($pattern)(\s.*?)?>"

    To match the first tag <B>. And to match the whole enchilada:

    m"<($patterns)(\s.*?){0,1}>.*?</\1>"sg;

    Here, \1 is whatever pattern we had found in '$patterns'. Hence, this matches the whole tag (in this case, the 'B' tag.)

    <B>hold</B>

    For our code example, Lets be a little bit simple, and only consider two possible strings to match: <B> and <I>. Furthermore, in this case, lets substitute bold text for italic, and vice-versa.

    However, lets write our code in such a way that it is generalizable. The code follows now:

    1 undef $/;

    2 my $fd = new FileHandle("$ARGV[0]");

    3 $line = <$fd>;

    4 my (%substituteHash) = ('B' => 'I', 'I' => 'B');

    5 my $patterns = join('|', keys(%substituteHash)); # makes B|I

    6 $line =~ s{(<) ($patterns)((\s.*?)?>) # opening tag (<B> or <I>)

    7 (.*?) # text inbetween

    8 (</) \2 (>) # closing tag (</B or </I>)

    9 }{$1$substituteHash{$2}$3$5$6$substituteHash{$2}$7}sgx;# subsitute

    Its lines 5-9 which are the killers. Lets look at them closely. The first thing that we notice is that we have parenthesized everything. (<), ($patterns), etc. This way, we insure that everything that we save will be accounted for when we substitute back in.

    Secondly, notice the 'sgx' modifiers on the end. We need to use the 's' modifier in the case that we have something like:

    <B>

    text here

    </B>

    We need to use the 'x' or else the expression will become unreadable, and finally we need to use the 'g' so that this matches more than one time.

    Finally, notice the '$1....$7' substitute variable. This is us, plunking in the data which we have matched. In particular, the $substituteHash{$2} takes whatever tag we find, and converts from, say 'B' to 'I'.

    This example shows a lot about the crafting of regular expressions. The next, and final example goes even further.

    Example 5: http, ftp tags

    You know the ones I'm talking about:

    http://members.aol.com/tlyco/KITH/index.html

    ftp://ftp.x.org

    Suppose that you've got a file of text containing them (for simplicity sake, lets suppose that they aren't split between lines) and you want to abstract them out. Well, the first thing we need to do is come up with a regular expression to match these. Lets look what we need to match:

    1) 'ftp:' or a 'http:' followed by '//'.

    2) an internet address (members.aol.com, 128.101.22.1)

    3) several, optional paths. '/tlyco/KITH/index.html'.

    This translates into something that looks like this:

    m"((?:ftp|http)://(?:\w+\.){2,5}\w*(/\S*)?)"g)

    Which is outlined in Figure 9.10:

    Figure 9.10:

    Figure 9.10

    Matching a http, ftp tag

    This did take a little trial and error. We use (?:) so we don't get any backreferences, and the (?:ftp|http) is self explanatory, but the (?:\w+\.){2,5}\w* needs a little bit of explaining, as does the (/\S*){0,1}.

    (?:\w+\.){2,5}\w*. Figure: Each one of the \w+\. is meant to match 'members.', or 'aol.'. However, this isn't enough to match the whole internet address. Why? Because the internet address has a trailing group of letters. (?:\w+\.){2,5} does not match 'members.aol.com' it matches 'members.aol.'. We need to add a '\w*' to match the last bit, the 'com' part.

    (/\S*)? In English, this is saying "match a slash followed by as any non-spaces as you can find, and do it zero or one times." It refers to the fact that you can have a trailing path on an http or ftp address, but it is not a necessity to have this. This results in the optional question mark.

    Note that this pattern is not perfect. If you have backslashed spaces in the http tag in particular, this won't work. Or, if you have a http daemon running on a different port (http://site:8080 for example) it won't work. However, it is as we say 'close enough': if you want to improve it to handle such cases, go right ahead.

    Let's now make a loop which extracts all http, and ftp tags from a given file, for example a bookmarks file:

    undef $/;

    my $fd = new FileHandle("$ARGV[0]");

    $line = <$fd>;

    while ($line =~ m" m"((?:ftp|http)://(?:\w+\.){2,5}\w*(/\S*){0,1})"g)

    {

    my $tag = $1;

    chop($tag);

    push(@tags, $tag);

    }

    Here, we chop off the last character, since in a bookmarks file these tags are in double quoted strings. We shall approach this problem more directly next, when we consider matching a double quoted string.

    Example 6: C comments and Double Quoted Strings

    There are two parts to this example: a easy one, and a difficult one. If you understand the hard one, then you're on the way to regular expression nirvana. Those who want to get more information can go to 'Mastering Regular Expressions', where these examples are fleshed out with much more detail.

    We have already given the expression for finding C comments via lazy regular expressions. It looks like this:

    $line =~ m"/\*.*?\*/"g;

    in which we match a '/*' and then the minimum number of characters till */, and then finally a '*/'. To get all the comments in a given C file, you could say:

    undef $/;

    use FileHandle;

    my $fh = new FileHandle("$ARGV[0]");

    @comments = (<$fh> =~ m"/\*.*?\*"g);

    which uses the common 'g' form to get the comments out of the text, along with 'undef $/' which makes <$fh> get all of the text out of the first file argument.

    So much for the easy part. However, I also said that there was another greedy version that does the same thing:

    $expression = m"/\*([^\*/]|\*[^/])*\*+/";

    Lets readify this a bit with /x:

    $expression = m{

    (/\*) # matches beginning

    ([^\*/]|\*[^/])* # matches junk in middle, comment

    (\*+/) # matches ending

    }x;

    What in the world does this do? Exactly the same thing as the '/\*.*?\*/'. And why would anyone want to write something like this, instead of writing the relatively simple '/\*.*?.\*/'?

    In the case above, the only answer would have to do with efficiency. When you have a lot of data, you can craft a greedy expression that can go much quicker than the corresponding lazy version. However, in most cases, you probably would want to write the simpler version, if anything, for readability and maintainability.

    In other cases, however, you cannot be so simple. The one we shall consider is the double quoted string. In the first case, we might consider that we could match a double quoted string by:

    m/".*?"/;

    But alas, that is not to be. This works fine for '"True Lies"', but it does not work fine for '"The Man who cried \"Uncle\"";. In this second case, we have used the trick that Perl uses, we have escaped the special character " so it can be included in a " " string! Hence, the pattern ".*?" will match "The Man who cried \" instead of the whole thing.

    So how do we get around this? Well, by trickery. By suiting our regular expression to the pattern at hand, and thinking, we can overcome the problem.

    Hence, lets consider some possible strings we need to match:

    1 "" (empty string)

    2 "\"" (one quote)

    3 "a" (one letter)

    4 "a\"" (one letter, then a quote)

    5 "a\"a" (one letter, then a quote, then a letter).

    6 "a\t\"" (one letter then a backslashed special character, then a backslashed quote)

    These might be what we would consider boundary conditions. If we are going to screw up anywhere, its going to be here. So, let's start building our regular expression from the ground up, so to speak. We first match the obvious:

    m/".*?"/;

    Which strings does this match? It matches 1 and 3.

    Then, lets add the ability so that 2 can match:

    m/"(\\"|.*?)*"/;

    But this doesn't quite work in matching #4. If our string is "a\"", then .*? will match the "a\"", and we would have a trailing ". because, on the next time through the '*' this will not let \\" match (the backslash is eaten). Hence, lets reconsider '.*?' here:

    m/"(\\"|[^\\"]*)*"/;

    This eats up all the characters, however it doesn't quite work on example 6. The '\t' doesn't match alternation #1, and backslash isn't allowed by alternation number 2 at all. Hence the regular expression stops at the a (bolded) in'"a\t"'. Therefore, we need to allow for this:

    m/"(\\.|[^\\"]*)*"/;

    This works now on all six cases, and therefore, works to match. However, we have run straight into a trap. What is this trap? Remember backtracking, in which the regular expression tries every single possibility, and then, and only then gives up when all possibilities are tried? Well... this expression goes through an inordinate amount of work in figuring stuff out. Even a small string such as:

    "bbb

    goes through this much before it fails to match, even if we ignore the first alternation:

    "bbb (m/"([^\\"]*)*"/); # matches parens.

    "bbb (m/"([^\\"]*)*"/); # inner bracket matches all 3 characters.

    "bbb (m/"([^\\"]*)*"/); # outer bracket matches on one copy of inner three characters, fails on "

    "bbb (m/"([^\\"]*)*"/); # Inner matches 2 chars

    "bbb (m/"([^\\"]*)*"/); # Outer matches once on the two chars fails "

    "bbb (m/"([^\\"]*)*"/); # Inner matches once on last char

    "bbb (m/"([^\\"]*)*"/); # Outer matches once on last char, fails on "

    "bbb (m/"([^\\"]*)*"/); # Inner matches once on first char

    "bbb (m/"([^\\"]*)*"/); # Outer matches 3 times/ first char,fails on "

    Even in this case, with four characters in the string, we have a total of 9 tried matches, 12 if you count the end parentheses, 24 if you count the alternation which we ignored! In other words, the two stars 'battle it out', where the inner one tries to match as many characters as it can, and the outer one then drives the inner one to try more and more combinations before finally quitting.

    When the string being matched is a long one, this can take literally, lifetimes. How do you recognize this error? First, realize that the backtracking principle is a lot more than just a mathematical curiosity. It can produce very palpable bugs, like this one.

    Second, if you think that you have a bug like this, try your script versus the '-Dr' switch on the command line, something like:

    prompt% perl5 -Dr script.p

    This will run your script, but show your compiled regular expressions on the screen. Notice that you need to compile Perl with the '-DDEBUGGING' script, or get a Perl binary built with that as internals. We shall talk about this more in 'Debugging Perl'.

    Finally, realize that you are going to run into a couple of these. You can usually get round them by finding a way to do the same expression without the star. In this case:

    m/"((?:[^"\\]|\\.)*)"/;

    will do the trick, since we don't need the internal star here. We also make it more efficient by putting the most common alternation first ([^"\\]) and by using ?: so no backreferences are saved except the whole thing. This little regular expression that we end up with is a gem, and can be used, for example to rewrite the 'recognizing HTTP, and FTP addresses example. Since FTP and HTTP addresses are double quoted strings inside html files, this code becomes:

    use FileHandle;

    my $line = new FileHandle("$ARGV[0]");

    while ($line =~ m/HREF="((?:[^"\\]|\\.)*)"/)

    {

    push(@addresses, $1);

    }

    Which will then go through each file and pick out the addresses for that file.

    Putting It All Together.

    This chapter has been a long one primarily because 1) regular expressions are powerful, 2) regular expressions are heavily integrated into the rest of the language, and 3) people get confused about regular expressions and seldom use them correctly. If you learn their secrets, they make your life a lot easier, especially if you deal with large amounts of data.

    Regular Expressions form almost a 'language within a language' in Perl. As you can see above, they can be fairly involved, and (lets face it) if you are not familiar with them now, you are not going to learn them without practice. Therefore, we suggest the following path for learning regular expressions.

    Learn well the principles in this chapter. In order to use regular expressions effectively, you are going to need to know them, or you will be spinning your wheels quite a lot. Then try several regular expressions in actual use, working from simple to complicated.

    Also, look at other examples in the book. I haven't included nearly as many as I would have liked in this chapter, so I stuffed a few in Chapter 14, just for good measure! Look at the sections about regular expression debugging, in the chapter on 'Debugging Perl'.

    Be sure to check out the documentation in perlref and check out Jeffrey Friedl's book, 'Mastering Regular Expressions'. Whereas this chapter will get you started and give you some skill in regular expression manipulation, it cannot possibly give the same treatment that his book goes into. As far as I know, its the only book of its kind out there, fully devoted to practical use of regular expressions.

    Finally, let your knowledge about regular expressions evolve. Start with simple regular expressions, and as your knowledge grows, let your regular expressions grow.

    Orders Orders Backward Forward
    Comments Comments

    COMPUTING MCGRAW-HILL | Beta Books | Contact Us | Order Information | Online Catalog


    HTML conversions by Mega Space.

    This page updated on October 14, 1997 by Webmaster.

    Computing McGraw-Hill is an imprint of the McGraw-Hill Professional Book Group.

    Copyright ©1997 The McGraw-Hill Companies, Inc. All Rights Reserved.
    Any use is subject to the rules stated in the Terms of Use.