Regex

Table of Contents

Motivation
Regular Expressions
R Functions for Regular Expressions
Literal Strings
Alternatives
Character Classes
Named Character Classes
Character Set Complement

Motivation

Binary files allow us to explicitly encode the type of a value, be it an integer or real number, a string, etc. However, so much of the data we deal with are given to us as text. We input numbers in text files, we download text files from Web and FTP servers, we save spreadsheets as comma-separated values in .csv files. Numbers are often organized in tables and transferred in regular text files. Genomic sequence data are long strings from the 4 letter alphabet {A, C, G, T}. In these cases, the data are merely represented by their text form and are easily interpreted by applications. However, there are many examples of more complex situations where the data are not as easily interpreted as numbers. Instead, the text must be processed to create the values of interest. The simplest examples of such text is when the values are embedded into the text, but not in a regular or simple format. Instead, we must extract the different elements from the content by identifying the patterns where the values occur. In other cases, the text is the data and we must search for the presence of certain words or phrases in particular contexts or places. We might be intested in how often each word is used, the names of the author(s), references to other documents, etc. Nowadays, documents and text are treated directly as data such as in search engines, databases, and so on. Web logs are a good example Mail messages Mail logs Network packet data (tcpdumpdata)

Regular Expressions

In order to be able to manipulate textual data, we need tools to express some common tasks and operations on the data. We need to be able to ask questions such as

does this text contain a particular word or sequence of characters
does this line start with the text "From:"
does this text contain a sub-string of the form three A's followed by either a G or two T's.
does the text contain any word repeated twice in succession, e.g. a a, the the, cat cat.
substitute any occurrence of the string "bob" with "Bob".
discard any comments in the text identified by # to the end of the line.

Using any general programming language such as Matlab, Java, C, Perl, etc., we could develop functions to perform these different sample tasks for operating on strings and patterns within them. Most people would approach these problems in a reasonable way by breaking the input text into a collection of characters and implementing the different operations by iterating over these characters looking for the particular pattern. For example, we might write an R function that would determine if the input text string contained the word given by the argument pattern.

findWord = function(string, pattern) { letters = substring(string, 1:nchar(string), 1:nchar(string)) el = substring(pattern, 1, 1) possibles = which(letters == el) any(pattern == substring(string, possibles, possibles + nchar(pattern) - 1)) }

This matches literal strings given by pattern within the given string by searching for all the occurrences of the first character in pattern and then looking at all substrings with the same length as pattern starting from those points. This works fine, but is quite limited. It cannot handle anything but literal patterns to match. In many circumstances, we want to specify more complicated patterns than simple literal strings. Instead, we want to ask about alternative patterns, i.e. of the form this or that.

For example, we might want to ask does a string contain the pattern duncan@wald.ucdavis.edu or duncantl@ucdavis.edu, two e-mail accounts at UC Davis. We can perform two searches, one on each literal string and combine the results. We could even do this in our function by allowing pattern to be a vector rather than a simple string. Would this mean that a match occurs if any of the values in pattern are found? or only if all the patterns are found? The burden is on the caller to specify the two literal strings. If these patterns are long but very similar, this gets tedious and error prone. For example, suppose we have two e-mail accounts duncantl@wald.ucdavis.edu and duncan@wald.ucdavis.edu. Then we would like to look for both but specify this more conveniently than specifying them both in explicit form. Looking at the two strings, we would like to be able to express the concept of "find a match for duncan, optionally followed by 'tl', and then followed by @wald.ucdavis.edu". So we need a way of specifying an optional piece of the pattern.

To solve the problem of looking for patterns that start a new line, for example, we would need the caller to specify this in the pattern as "\nFrom".

We can write a suite of functions to perform the different types of matching and substitutions we have outlined. Since they are commonly needed, it is reasonable to guess that others might have already implemented these for their purposes and the might be able to reuse their code. As with all software, we would like to reuse such code as it is likely to be better tested and more efficient than our initial efforts. And it turns out that indeed people have written such tools. Even with such tools, the users have to remember the different functions and their arguments to express different types of matching and substitution. The functions might be part of a programming language, but within this programming language, we have to handle invoking the relevant function for our intent. Ideally, we would be able to use a language to express these patterns in a unified manner. And indeed, there is such a general language with several implementations that are well-tested and efficient. Patterns specified in this language are called regular expressions. And, like many programming languages, the language of regular expressions provides basic building blocks from which we can create

In the course of the next few sections, we will try to movtivate some of the building blocks of the regular expression language and illustrate their use. Since it is essentially a programming language, we can develop complicated constructs by combining the different elements in different ways. The goal here is to teach you about these basic elements of the language, how to remember what they do and how they can be combined to express patterns we want. There are many tutorials and examples on the Web that cover different uses and applications of regular expressions. There are also books on the subject that you can read to get more examples and a greater understanding of how regular expressions work. You are strongly encouraged to read these documents to see different examples. And most of all, you need to practice creating regular expressions and testing them on data in R or an other programming language to get both an understanding and experience so that when you need to use them, they will be familiar.

R Functions for Regular Expressions

There are several building blocks in the regular expression language for specifying patterns that are to be matched in a piece of text. These allow us to match literal strings, a character from a particular set of characters or its complement (character sets), one sub-pattern or another (alternation). We can also match by position such as at the beginning or end of a line, and we can create sub-patterns from individual patterns by specifying the number of times the pattern should be match (quantifiers).

We will look first at simply matching patterns in text/strings. There are two functions in R that allow us to do this. These are grep and regexpr. Each of these takes the regular expression specifying the overall pattern to match and then a character vector containing the different text strings. grep returns the indices of the elements of that character vector for which there was a match (or the empty integer vector if none matched). This can be readily used to subset the character vector to get only the elements containing or not containing that pattern. It is used as a means to find the strings that matched. For example, to find the elements in the character vector c("a test", "a basic string", "and one that we want", "one two three") that contain the literal string "one", we can use grep:

> grep("one", c("a test", "a basic string", "and one that we want", "one two three")) [1] 3 4

This indicates that the 3rd and 4th elements contained a match. The first two elements did not.

The function regexpr returns more detailed information telling us a) which elements of the character vector actually contained the pattern in the regular expression, and also b) identifies the position of the substring that was matched by the regular expression pattern. In our simple example above looking for the string "one", we can use regexpr.

> regexpr("one", c("a test", "a basic string", "and one that we want", "one two three")) [1] -1 -1 5 1 attr(,"match.length") [1] -1 -1 3 3

The return value is an integer vector with an element for each of the elements in the strings vector. Each element gives the position of the starting character of the match, if it exists, and -1 when no match occurs for that string. To identify the end of the substring that matches the pattern, we need the position of the last character for each string. This is given by the second integer vector returned from regexpr. This is returned in a slightly odd form, namely as an attribute attached to the first integer vector of starting positions. This is a particular way in S that we can treat the return value directly as a simple integer vector (in this case) while still carrying around additional information with it. To get the substrings in each string that correspond to the vector, we can use this return value along with substring:

> substring(c("a test", "a basic string", "and one that we want", "one two three"), x, x + attr(x, "match.length")-1) [1] "" "" "one" "one" >

This is a very useful tool for getting an understanding of what a particular pattern atcually matches. We can create a pattern to match and then give it different test strings and see which parts actually match. This is a very important exercise to practice to really understand regular expressions.

There are more general ways to get the matching substring via substitution and these allow us to match multiple sub-patterns within our overall match and extract the different pieces directly. And that brings us to the other functions in R for dealing with regular expressions. gsub and sub are the two functions that allow us to replace patterns (or a single pattern) within a string with some other text. Each of these functions takes the regular expression defining what to match, another regular expression to use as the replacement text, and then the text/strings on which to do the matching and substitution. We can take our example above with a slightly different set of strings to illustrate what gsub does and how it differs from sub. We will replace the word (or literal string, actually) "one" with the digit 1.

> gsub("one", "1", c("a test", "and one and one is two", "one two three")) [1] "a test" "and 1 and 1 is two" "1 two three" >

There was no "one" in the first string ("a test"), so there was no way to substitute the match with the replacement text ("1"). So it remains unaltered and is returned "as is". In the second string, there are two occurrences of the string "one". Each of these are replaced with the digit "1". And similarly, the third string has its occurrence of "one" replaced with "1".

The 'g' in the name gsub refers to global. This means that we should change all the matches of the regular expression in the text with the replacement pattern. The sub is almost exactly the same as gsub except that it only replaces the first occurrence of the pattern with the replacement text.

> sub("one", "1", c("a test", "and one and one is two", "one two three")) [1] "a test" "and 1 and one is two" "1 two three"

Literal Strings

The most basic building block in a language that supports matching patterns in text is a facility for specifying a little string of text as a pattern to match. We have already seen this: we look for the literal string "one" in the the text "and one and one is two". When we write down a regular expression pattern like "one", what we mean is really the following: find the character 'o' immediately (i.e. the next character) followed by an 'n' and that followed immediately by 'e'. What the regular expression matching engine does is, for each of the target strings, start at the first character and check to see if it is a 'o'. If not, it then moves to the second character and starts looking there. When it finds a 'o', it then checks if the next character is an 'n'. If not, it starts looking for the 'o' at the next starting position. So we can think of the literal string as being made up of three consecutive sub-patterns: 'o', 'n' and 'e'. Thinking of the pattern "one" in this way makes it easier to see how we can combine different types of complex patterns (see below) to define a sequence.

Alternatives

Character Classes

Suppose we want to match a white space character, i.e. a space, a TAB, etc. We can use alternation that we described above as in the pattern " |\t". And if we wanted to match any digit, i.e. 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9, we could use the construction: "0|1|2|3|4|5|6|7|8|9". This is tedious to write and becomes difficult to read as the number of characters to be matched becomes lengthy. We are not expressing a concept of "match a digit", but instead explicitly enumerating the characters. This makes maintaining and understanding the regular expression more difficult. Additionally, these single character alternations are not very efficient. They can slow down the speed with which the regular expression automata performs the matching.

Instead of using alternations for single characters, the regular expression language provides a way to specify a match as being any character in a specified set. These patterns are called character classes. We use the [ ] notation to identify a character class and specify the collection of characters that constitute a match. For example, to match a space or a TAB character, we use

[ \t]

To match a digit or A, B, C, D, E or F, we can use

[0123456789ABCDEF]

This gives us the "digits" in hexadecimal numbers.

Basically, we can enumerate any collection of characters within the [ ] and these are included in the set that constitutes a match. We can also adapt this notation very slightly to indicate a match on the complement of the set of characters. There are many collections of characters that are commonly used. For example, we often want to specify all the letters of the alphabet, lower or upper case or both. And we often want all the digits. And in other cases, like the hexadecimal digits, we want a subset of these sets. The - character when used within the character class pattern (i.e. the []) typically identifies a range. We can specify the digits 3,4, 5, 6 more readily as

[3-6]

and this is the same as

[3456]

Similarly, we can specify

[0-9A-Fa-f]

to match all the hexadecimal digits, both lower and upper case. And you will often see

[A-Za-z]

for all letters in the alphabet. Once again, we are expressing higher level concepts that are easier to read than explicitly enumerating the elements of the set. Consider

[0-8]

and

[012345678]

To compare these, we have to look at the second and verify that the numbers are contiguous and we are not omitting any of the earlier values.

If we want to include the character - in our set of characters to match, then we must put this at the beginning of the character set. For example, to match the basic arithmetic operators - +, -, * or / - we can use

[-+*/]

^[1] Or to match a digit with either a + or - in front of it, we can use

[-+][0-9]

This example illustrates how we make up the overall pattern with a sequence of sub-patterns and each of these sub-patterns is made up using the primitive elements.

Named Character Classes

Character classes are very convenient, and the range operator (-) to succinctly specify collections of characters further simplifies their use. The regular expression language provides a collection of built-in character sets for commonly used collections. Each of these is identified by a short name given in the list below and includes the collection described there.

All alphabetic and numeric (alnum)
All alphabetic (alpha)
Blank characters, i.e. space or tab (blank))
Control characters (cntrl)
Digits (digit)
Printable character except space (graph)
Lower case alphabetic characters (lower)
Upper case alphabetic characters (upper)
Printable characters (print)
Punctuation characters (punct)
White space (space)
Hexadecimal digit (xdigit)

We use these collections using the same [] notation, but we specify the named character set with an additional [: :] pair. So, for example, we can specify the punctuation characters as

[[:punct:]]

The '[:punct:]' term is the named character class. We can include additional characters in the overall set such as

[[:punct:]a-h]

to match punctuation characters and the characters a, b, c, d, e, f, g, h.

Character Set Complement

The character set mechanism is convenient for specifying a set of characters that we want to match. In some circumstances, we want specify a set of characters that we don't want to match. Or more specifically, we want to match an element in the complement of a particular set, i.e. anything but an element amongst these. Once again, the regular expression language provides us with a mechanism to do this. By putting the ^ (caret) character at the beginning of the character set, we specify the complement of the set of specified characters. If we want to include ^ in a regular character set and not mean the complement, we need only list it in the set in any position other than the first.

> regexpr("[a-z^]", "0^2") [1] 2 attr(,"match.length") [1] 1

Mastering Regular Expressions, Jeffrey E. F. Friedl. O'Reilly. Second edition.

David Mertz, http://gnosis.cx/publish/programming/regular_expressions.html Text Processing in Python.