next up previous contents
Next: Functions Up: Regular Expressions Previous: Matching Multiple Occurences of   Contents

Substitutions

Now let's focus on the other regular expression operator, the s operator. Using this operator, we can search for regular expressions, and replace all or part of them with text of our choosing. In the simplest case, we may want to replace all of the literal occurences of one string with another string. Suppose we wish to replace all occurences of the word ``dog'' in a piece of text with the word ``cat''. A common error when using the substitution operator is to forget the g option in this case. Remember that, without this option, only the first occurence of the regular expression will be changed. If the text within which we wanted to perform the changes was called $line, we could use the following command:
     $line =~ s/dog/cat/g;
Note that this will change every occurence of dog to cat, even occurences embedded inside other words (like endogenous), which may not be what you desire. The word boundary symbol, \b should always be used when you specifically want to change a word, not just a character string:
     $line =~ s/\bdog\b/cat/g;
The previous command will not change the string dog to cat unless dog appears alone as a word. To make the substitution case-insensitive, simply add an i either before or after the trailing g of the substitute command. The s operator returns the number of substitutions which took place. This is handy in its own right, but also allows you to use the operator as a clause of a while or if statement which will be executed if any substitutions took place.

Sometimes you wish to make substitutions in a variable while retaining a copy of the original, unsubstituted string. For example, you may have a piece of text containing angle brackets. To print this text to the screen, you'd want to leave the angle brackets in place, but to insert the string in an html file, you'd need to substitute the symbol &lt; for the left angle bracket (<), and &gt; for the right angle bracket (>). A common perl construct for this sort of task is to use a parenthetic assignment on the left hand side of the regular expression operator; this copies the string to a new location and then modifies the copy, while leaving the original intact. Suppose that our text containing angle brackets is contained in a string called $orig, and we wish to create a string called $new which has the modified version. We could use code like this:

     ($new = $orig) =~ s/</&lt;/g;
      $new =~ s/>/&gt;/g;
Naturally, the assignment of $orig to $new could have been done in a separate statement, but most perl programmers use the above construct to copy and modify a character string in a single statement.

The tagging mechanism described in Section [*] can be used to rearrange pieces of a string in the substitution. Suppose we have lines of text, each with a word followed by a number, and we wish to print out the lines with the number preceding the words. We can tag each piece of the line and then refer to the tagged pieces in the usual way ($1, $2, etc.) on the right hand side of the substitution:

     s/(\w+)\s+(\d+)/$2 $1/;
The use of \s+ as a separator between the two patterns allows for any amount of whitespace between the word and the number.

One of the modifiers in Table [*] requires some additional explanation. When you use the e modifier in a substitution operation, the substituted text is not used literally; it is treated as a piece of perl code, and evaluated. Consider a file which contains line numbers at the beginning of each line, and we wish to print the lines, but with each line number incremented by five. Note that there is no direct solution for the problem using regular expressions. But if we replace numbers with the result of evaluating an expression like $x + 5, the problem becomes very simple. Here's an implementation of that idea using the e modifier:

     while(<>){
         s/^(\d+)/$1 + 5/e;
         print;
     }
Each occurence of a number at the beginning of each line will be replaced by the result of evaluating a perl statement which adds 5 to that number. Due to the greediness of regular expressions (Section [*]), there's no need to use a word boundary after the tagged expression. Another use of the e modifier is to reformat numbers using the sprintf function (Section [*]). Suppose we want to insure that all the numbers in a report are printed in a field width of seven, with two decimal places. The following substitution command will do the job.
   s/(\b)([\d.]+)(\b)/sprintf("%s%7.2f%s",$1,$2,$3)/eg;


next up previous contents
Next: Functions Up: Regular Expressions Previous: Matching Multiple Occurences of   Contents
Phil Spector 2002-10-18