7.2   Regular Expressions


This section under major construction.

Manipulating text (email, web pages, word processing documents, DNA sequences, Java programs) is one of the most basic and important computing tasks.

Pattern matching. One of the most important text processing problem is pattern matching. The simplest version of the problem is to search for one occurrence (or all occurrences) of some specific string in a large document. For example, we might want to search for the word "needle" in a large word processing document, or we might want to search for all occurrences of the string "acgagggtttaccgaagaat" in the human genome. The BLAST database from the NCBI is a quintessential searching tool for computational molecular biologists. Sometimes we are interested in searching for more complex patterns. When validating a web form, we might like to check whether or not the email address is syntactically valid, e.g., wayne@cs.princeton.edu looks like an email address, but gf!@#$bB does not. A simple test for a valid email address might be: a sequence of one or more lowercase letters, followed by the @ symbol, followed by another sequence of lowercase letters followed by a . followed by another sequence of lowercase letters followed by edu or net. In this section, we will explore effective ways for specifying such patterns and solving the corresponding pattern matching problem.

Other applications. Scan for virus signatures, process natural language, search for information using Google, access information in digital libraries, access information in databases (e.g., Oracle Database 10g, MySQL 3.23) retrieve information from Lexis/Nexis, search-and-replace in a word processors, filter text (spam, NetNanny, Carnivore, malware), validate data-entry fields (dates, email, URL, credit card). Another class of important applications is parsing data files in the natural and social sciences. Some computational physics experiments generate terrabytes of output. It is hopeless to search for useful information without automated methods. Read in data stored in TOY input file format, compile a Java program, crawl and index the Web, automatically create Java documentation from Javadoc comments, ...

Regular expressions. A regular expression is notation for specifying a set of strings, e.g., the set of all valid email addresses or the set of all binary strings with an even number of 1s. Since the set might contain infinitely many members, we can't simply enumerate them. There are five basic operations for creating regular expressions, and the table below illustrates them by example.

Operation Regular Expression Yes No
Concatenation aabaab aabaab every other string
Logical OR
(Alternation)
aa | baab aa
baab
every other string
Replication
(Kleene closure)
ab*a aa
aba
abba
ε
ab
ababa
Grouping a(a|b)aab aaaab
abaab
every other string
Wildcard a..a abba
abaa
aa
aaaaa

Building blocks of Java vs. regular expressions. These building blocks parallel the way Java programs are written. Straight-line programs are sequence of instructions executed one after the other. The concatenation operation is formed by combining a sequence of regular expressions, one after the other. Branch statements (if-else) enable us to execute one of several possible statements. The logical OR operation matches one of two possible expressions. Combining several branch statements or OR operations enable us to deal with exponentially many possibilities. While loops take us into the infinite and allow us to repeat things any number of times. The replication operator plays a similar role. Finally, in Java we group statements together in blocks by surrounding them with curly braces. With regular expressions, we use parentheses. There is one crucial difference between Java programs and regular expressions. In Java, we can declare variables and use them to store intermediate values in a computation, whereas with regular expressions there is no such mechanism to remember state.

Regular expression examples. The operators described are the building blocks of regular expressions. By combining them in clever ways, we can express very complicated patterns. One particularly useful regular expression is (a|b)*. This means take either the symbol a or the symbol b, zero or more times. This has the effect of matching all possible strings of a's and b's. If we want to match all strings that contain aacgaagggact as a substring, we can use the regular expression (a|c|g|t)*aacgaagggact(a|c|g|t)*. The following table illustrates some additional combinations.


Regular Expression Yes No
a* | (a*ba*ba*ba*)*
multiple of 3 b's
ε
bbb
aaa
abbbaaa
bbbaababbaa
b
bb
abbaaaa
baabbbaa
a | a(a|b)*a
begins and ends with a
a
aba
aa
abbaabba
ε
ab
ba
(a|b)*abba(a|b)*
contains the substring abba
abba
bbabbabb
abbaabba
ε
abb
bbaaba


FMR-1 triplet repeat region. "The human FMR-1 gene sequence contains a triplet repeat region in which the sequence CGG or AGG is repeated a number of times. The number of triplets is highly variable between individuals, and increased copy number is associated with fragile X syndrome, a genetic disease that causes mental retardation and other symptoms in one out of 2000 children." (Reference: Biological Sequence Analysis by Durbin et al). The pattern is bracket by GCG and CTG, so we get the regular expression GCG (CGG | AGG)* CTG.

Using regular expressions. Regular expressions are a standard programmer's tool. They are built into most modern languages including Java, Perl, and Python. The Unix program grep and the Windows program findstr are indispensable command line tools when searching for data (in programs, email, or wherever). History of grep and other Unix tools. Developed by Ken Thompson in 1973 to aid McIlroy in reading text aloud through a voice synthesizer. Program Validate.java demonstrates how to use regular expressions in Java using the string library method matches. The user enters the text string and the pattern on the command line, and the program prints out true or false, depending on whether or not the text matches the pattern.

public class Validate { 
    public static void main(String[] args) { 
        String pattern = args[0];
        String text    = args[1];
        System.out.println(text.matches(pattern));
    }
}

As a simple example, we might check to see if the text has a multiple of 3 b's.

% java Validate "a* | (a*ba*ba*ba*)*" abbbaaa
true

% java Validate "a* | (a*ba*ba*ba*)*" abbaaa
false

We note that we must use quotation marks to delimit the pattern, since otherwise the operating system would interpret the | and * symbols before passing along the string to the Java program.

The first four basic operations above (concatenation, logical or, replication, grouping) are the theoretical minimum needed to describe regular expressions. Most programming environments support additional operations for convenience (including the wildcard operation), and Java is no exception. The table below includes some of the highlights.


Operation Java Regular Expression Yes No
One or more a(bc)+de abcde
abcbcde
ade
abc
Once or not at all a(bc)?de ade
abcde
abc
abcbcde
Character classes [a-m]* blackmail
imbecile
above
below
Negation of character classes [^aeiou] b
c
a
e
Exactly N times [^aeiou]{6} rhythm
syzygy
rhythms
allowed
Between M and N times [a-z]{4,6} spider
tiger
jellyfish
cow
Whitespace characters [a-z\s]*hello hello
say hello
Othello
2hello


Program Grep.java reads in a regular expression from the command line and processes standard input one line at a time, printing out any lines contain exact matches as substrings.


Here is Oracle's guide to using regular expressions. It includes many more operations that we will not explore. Also see the String methods matches(), split(), and replaceAll(). These are shorthands for using the Pattern and Matcher classes. Here's some common regular expression patterns.

Regular expression applications. Email addresses, Java identifiers, integers, decimal, ....

Our email address example has been greatly simplified for convenience. Here's a library of useful regular expressions that offers industrial grade patterns for email addresses, URLs, numbers, dates, and times. Try this regular expression tool.

Harvester. The method String.matches is useful when we want to search for some pattern in a string. To do repeated searches in Java, we specify the pattern using a regular expression, preprocess it using Pattern.compile and Pattern.matcher, and check for matches using Matcher.find and Matcher.group. Program Harvester illustrates this process. It prints out all of the email addresses on a given web page. Create a regular expression that matches email addresses that start with one or more letters, followed by a @, followed by a sequence of letters, then a ., then letters, then a ., and so forth, and finally ending with com, edu, or gov. Note: to match a ., we use //. in the Java regular expression to escape the .. The variable input is a long string.

Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
   System.out.println(matcher.group());
}

Ad blocking. Adblock uses regular expressions to block banner adds under the Mozilla and Firebird browsers.

Parsing text files. A more advanced example where we want to extract specific pieces of the matching input. This program typifies the process of parsing scientific input data. The TOY program input file consists of a sequence of lines. If a line begins with two hex digits xy, followed by a colon, followed by a space, followed by four hex digits abcd, then we load instruction abcd into memory location xy. Otherwise, we ignore the line. We use a Java regular expression to parse each line.

String filename = args[0];
In in = new In(filename);
String regexp = "([0-9A-Fa-f]{2}):[ \t]*([0-9A-Fa-f]{4}).*";
Pattern pattern = Pattern.compile(regexp);
String line;
while ((line = in.readLine()) != null) {
   Matcher matcher = pattern.matcher(line);
   if (matcher.find()) {
      int addr = fromHex(matcher.group(1));
      int inst = fromHex(matcher.group(2));
      mem[addr] = inst;
   }
}

The method call matcher.group(1) returns the substring that matches the portion of the regular expression delimited by the first set of parentheses (the two hex digits). The method call matcher.group(2) returns the substring that matches the portion of the regular expression delimited by the second set of parentheses (the 4 hex digits). The colon, whitespace, and any comments are ignored.

PROSITE. PROSITE is the "first and most famous" database of protein families and domains. Its main use it to determine the function of uncharacterized proteins translated from genomic sequences. Biologists use PROSITE pattern syntax rules to search for patterns in biological data. Here is the raw data for CBD FUNGAL (accession code PS00562). Each line contains various information. Perhaps the most interesting line is the one that begins with PA - it contains the pattern that describes the protein motif. Such patterns are useful because they often correspond to functional or structural features.

PA   C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C.

Each uppercase letter corresponds to one amino acid residue. The alphabet consists of uppercase letters corresponding to the 2x amino acids. The - character means concatenation. For example, the pattern above begins with CGG (Cys-Gly-Gly). The notation x plays the role of a wildcard - it matches any amino acid. This corresponds to . in our notation. Parentheses are used to specify repeats: x(2) means exactly two amino acids, and x(4,7) means between 4 and 7 amino acids. This corresponds to .{2} and .{4,7} in Java notation. Curly braces are used to specify forbidden residues: {CG} means any residue other than C or G. The asterisk has its usual meaning.

Could also mention the String library method split() that we used in the Kevin Bacon program to parse /-delimited text files. The argument can be a general regular expression, instead of just "/" character.

Wildcards. Specialization of regular expressions. Use symbol * to mean .*, i.e., match any number of characters. No parenthesis or logical OR operation. Common with operating systems, e.g., to list all files ending in .txt.

ls *.txt       // Unix, Linux, OS X
dir *.txt      // Windows

Also built in to search engines like Google. Find all web pages with the phrase premature * is the root of all evil.

Web crawler. Google indexes the web by starting with a web page and exploring all web pages that are directly linked from the first page. Program LinkFinder.java takes the name of a web page as a command line input, and prints out all web pages listed in the original web page. Use the following approximate rule: starts with http:// and is followed by any number of non-whitespace characters except ". This program could be used to check a webpage for broken links.

Program WebCrawler.java expands on LinkFinder.java by recursively finding the links on all of the pages that the original page points to, and so on. It maintains a queue of web pages that still need to be explored. As a result, the web pages are examined in breadth first order, starting from the initial seed page. To avoid visiting a web page more than once, we maintain a symbol table of previously visited pages. It is somewhat restricted since it only follows absolute web links of a special form, but it's a proof-of-concept program. Also, you get error messages for all of the URLs that are malformed on unreachable.

Q + A

Q. I'm confused why does (a | b)* match all strings of a's and b's, instead of only string with all a's or string with all b's?

A. The * operator replicates the regular expression (and not a fixed string that matches the regular expression). So the above is equivalent to ε | (a|b) | (a|b)(a|b) | (a|b)(a|b)(a|b) | ....

Q. History?

A. In the 1940s, Warren McCulloch and Walter Pitts modeled neurons as finite automata to describe the nervous system. In 1956, Steve Kleene invented a mathematical abstraction called regular sets to describe these models. Representation of events in nerve nets and finite automata. in Automata Studies, 3-42, Princeton University Press, Princeton, New Jersey 1956.

Exercises

  1. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
    1. 0 or 11 or 101
    2. only 0s

    Answers: 0 | 11 | 101, 0*

  2. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
    1. all binary strings
    2. all binary strings except empty string
    3. begins with 1, ends with 1
    4. ends with 00
    5. contains at least three 1s

    Answers: (0|1)*, (0|1)(0|1)*, 1 | 1(0|1)*1, (0|1)*00, (0|1)*1(0|1)*1(0|1)*1(0|1)* or 0*10*10*1(0|1)*.

  3. Write a regular expression to describe inputs over the alphabet {a, b, c} that are in sorted order. Answer: a*b*c*.
  4. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
    1. contains at least three consecutive 1s
    2. contains the substring 110
    3. contains the substring 1101100
    4. doesn't contain the substring 110

    Answers: (0|1)*111(0|1)*, (0|1)*110(0|1)*, (0|1)*1101100(0|1)*, (0|10)*1*. The last one is by far the trickiest.

  5. Write a regular expression for binary strings with at least two 0s but not consecutive 0s.
  6. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
    1. has at least 3 characters, and the third character is 0
    2. number of 0s is a multiple of 3
    3. starts and ends with the same character
    4. odd length
    5. starts with 0 and has odd length, or starts with 1 and has even length
    6. length is at least 1 and at most 3

    Answers: (0|1)(0|1)0(0|1)*,   1* | (1*01*01*01*)*,   1(0|1)*1 | 0(0|1)*0 | 0 | 1,   (0|1)((0|1)(0|1))*,   0((0|1)(0|1))* | 1(0|1)((0|1)(0|1))*,   (0|1) | (0|1)(0|1) | (0|1)(0|1)(0|1).

  7. For each of the following, indicate how many bit strings of length exactly 1000 are matched by the regular expression: 0(0 | 1)*1, 0*101*, (1 | 01)*.
  8. Write a regular expression that matches all strings over the alphabet {a, b, c} that contain:
    1. starts and ends with a
    2. at most one a
    3. at least two a's
    4. an even number of a's
    5. number of a's plus number of b's is even
  9. Find long words whose letters are in alphabetical order, e.g., almost and beefily. Answer: use the regular expression '^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$'.
  10. Write a Java regular expression to match phone numbers, with or without area codes. The area codes should be of the form (609) 555-1234 or 555-1234.
  11. Find all English words that end with nym.
  12. Final all English words that contain the trigraph bze. Answer: subzero.
  13. Find all English words that start with g, contain the trigraph pev and end with e. Answer: grapevine.
  14. Find all English words that contain the trigraph spb and have at least two r's.
  15. Find the longest English word that can be written with the top row of a standard keyboard. Answer: proprietorier.
  16. Find all words that contain the four letters a, s, d, and f, not necessarily in that order. Solution: cat words.txt | grep a | grep s | grep d | grep f.
  17. Given a string of A, C, T, and G, and X, find a string where X matches any single character, e.g., CATGG is contained in ACTGGGXXAXGGTTT.
  18. Write a Java regular expression, for use with Validate.java, that validates Social Security numbers of the form 123-45-6789. Hint: use \d to represent any digit. Answer: [0-9]{3}-[0-9]{2}-[0-9]{4}.
  19. Modify the previous exercise to make the - optional, so that 123456789 is considered a legal input.
  20. Write a Java regular expression to match all strings that contain exactly five vowels and the vowels are in alphabetical order. Answer: [^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
  21. Write a Java regular expression to match valid Windows XP file names. Such a file name consists of any sequence of characters other than
    / \ : * ? " < > |
    

    Additionally, it cannot begin with a space or period.

  22. Write a Java regular expression to match valid OS X file names. Such a file name consists of any sequence of characters other than a colon. Additionally, it cannot begin with a period.
  23. Given a string s that represents the name of an IP address in dotted quad notation, break it up into its constituent pieces, e.g., 255.125.33.222. Make sure that the four fields are numeric.
  24. Write a Java regular expression to describe all dates of the form Month DD, YYYY where Month consists of any string of upper or lower case letters, the date is 1 or 2 digits, and the year is exactly 4 digits. The comma and spaces are required.
  25. Write a Java regular expression to describe valid IP addresses of the form a.b.c.d where each letter can represent 1, 2, or 3 digits, and the periods are required. Yes: 196.26.155.241.
  26. Write a Java regular expression to match license plates that start with 4 digits and end with two uppercase letters.
  27. Write a regular expression to extract the coding sequence from a DNA string. It starts with the ATG codon and ends with a stop codon (TAA, TAG, or TGA). reference
  28. Write a regular expression to check for the sequence rGATCy: that is, does it start with A or G, then GATC, and then T or C.
  29. Write a regular expression to check whether a sequence contains two or more repeats of the the GATA tetranucleotide.
  30. Modify Validate.java to make the searches case insensitive. Hint: use the (?i) embedded flag.
  31. Write a Java regular expression to match various spellings of Libyan dictator Moammar Gadhafi's last name using the folling template: (i) starts with K, G, Q, (ii) optionally followed by H, (iii) followed by AD, (iv) optionally followed by D, (v) optionally followed by H, (vi) optionally followed by AF, (vii) optionally followed by F, (vii) ends with I.
  32. Write a Java program that reads in an expression like (K|G|Q)[H]AD[D][H]AF[F]I and prints out all matching strings. Here the notation [x] means 0 or 1 copy of the letter x.
  33. Why doesn't s.replaceAll("A", "B"); replace all occurrences of the letter A with B in the string s?

    Answer: Use s = s.replaceAll("A", "B"); instead The method replaceAll returns the resulting string, but does not change s itself. Strings are immutable.

  34. Write a program Clean.java that reads in text from standard input and prints it back out, removing any trailing whitespace on a line and replacing all tabs with 4 spaces.

    Hint: use replaceAll() and the regular expression \s for whitespace.

  35. Write a regular expression to match all of the text between the text a href =" and the next ". Answer: href=\"(.*?)\". The ? makes the .* reluctant instead of greedy. In Java, use Pattern.compile("href=\\\"(.*?)\\\"", Pattern.CASE_INSENSITIVE) to escape the backslash characters.
  36. Use regular expressions to extract all of the text between the tags <title> and <\title>. The (?i) is another way to make the match case insensitive. The $2 refers to the second captured subsequence, i.e., the stuff between the title tags.
    String pattern = "(?i)(<title.*?>)(.+?)(</title>)"; 
    String updated =  s.replaceAll(pattern, "$2");
    
  37. Write a regular expression to match all of the text between <TD ...> and </TD> tags. Answer: <TD[^>]*>([^<]*)</TD>

Creative Exercises

  1. Text to speech synthesis. Original motivation for grep. "For example, how do you cope with the digraph ui, which is pronounced many different ways: fruit, guile, guilty, anguish, intuit, beguine?"
  2. Challenging regular expressions. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
    1. any string except 11 or 111
    2. every odd symbol is a 1
    3. contains at least two 0s and at most one 1
    4. no consecutive 1s
  3. Binary Divisibility. Write a regular expression for each of the following sets of binary strings. Use only the basic operations.
    1. bit string interpreted as binary number is divisible by 3
    2. bit string interpreted as binary number is divisible by 123
  4. Boston accent. Write a program to replace all of the r's with h's to translate a sentence like "Park the car in Harvard yard" into the Bostonian version "Pahk the cah in Hahvahd yahd".
  5. File extension. Write a program that takes the name of a file as a command line argument and prints out its file type extension. The extension is the sequence of characters following the last .. For example the file sun.gif has the extension gif. Hint: use split("\\."); recall that . is a regular expression meta-character, so you need to escape it.
  6. Reverse subdomains. For web log analysis, it is convenient to organize web traffic based on subdomains like wayne.faculty.cs.princeton.edu. Write a program to read in a domain name and print it out in reverse order like edu.princeton.cs.faculty.wayne.
  7. Bank robbery. You just witnessed a bank robbery and got a partial license plate of the getaway vehicle. It started with ZD, had a 3 somewhere in the middle and ended with V. Help the police officer write regular expression for this plate.
  8. Regular expression for permutations. Find the shortest regular expression (using only the basic operations) you can for the set of all permutations on N elements for N = 5 or 10. For example if N = 3, then the language is abc, acb, bac, bca, cab, cba. Answer: difficult. Solution has length exponential in N.
  9. Parsing quoted strings. Read in a text file and print out all quote strings. Use a regular expression like "[^"]*", but need to worry about escaping the quotation marks.
  10. Parsing HTML. A >, optionally followed by whitespace, followed by a, followed by whitespace, followed by href, optionally followed by whitespace, followed by =, optionally followed by whitespace, followed by "http://, followed by characters until ", optionally followed by whitespace, then a <.
    < \s* a \s+ href \s* = \s* \\"http://[^\\"]* \\" \s* >
    
  11. Subsequence. Given a string s, determine whether it is a subsequence of another string t. For example, abc is a subsequence of achfdbaabgabcaabg. Use a regular expression. Now repeat the process without using regular expressions. Answer: (a) a.*b.*c.*, (b) use a greedy algorithm.
  12. Huntington's disease diagnostic. The gene that causes Huntington's disease is located on chromosome 4, and has a variable number of repeats of the CAG trinucleotide repeat. Write a program to determine the number of repeats and print will not develop HD If the number of repeats is less than 26, offspring at risk if the number is 37-35, at risk if the number is between 36 and 39, and will develop HD if the number is greater than or equal to 40. This is how Huntington's disease is identified in genetic testing.
  13. Gene finder. A gene is a substring of a genome that starts with the start codon (ATG), end with a stop codon (TAG, TAA, TAG, or TGA) and consists of a sequence of codons (nucleotide triplets) other than the start or stop codons. The gene is the substring in between the start and stop codons.
  14. Repeat finder. Write a program Repeat.java that takes two command line arguments, and finds the maximum number of repeats of the first command line argument in the file specified by the second command line argument.
  15. Character filter. Given a string t of bad characters, e.g. t = "!@#$%^&*()-_=+", write a function to read in another string s and return the result of removing all of the bad characters.

    String pattern = "[" + t + "]";
    String result  = s.replaceAll(pattern, "");
    
  16. Wildcard pattern matcher. Without using Java's built in regular expressions, write a program Wildcard.java to find all words in the dictionary matching a given pattern. The special symbol * matches any zero or more characters. So, for example the pattern "w*ard" matches the word "ward" and "wildcard". The special symbol . matches any one character. Your program should read the pattern as a command line parameter and the list of words (separated by whitespace) from standard input.
  17. Wildcard pattern matcher. Repeat the previous exercise, but this time use Java's built in regular expressions. Warning: in the context of wildcards, * has a different meaning than with regular expressions.
  18. Search and replace. Word processors allow you to search for all occurrences of a given query string and replace each with another replacement string. Write a program SearchAndReplace.java that takes two strings as command line inputs, reads in data from standard input, and replaces all occurrences of the first string with the second string, and sends the results to standard output. Hint: use the method String.replaceAll.
  19. Password validator. Suppose that for security reasons you require all passwords to have at least one of the following characters
    ~ ! @ # $ % ^ & * | 
    
    Write a regular expression for use with String.matches that returns true if and only if the password contains one of the required characters. Answer: "^[^~!@#$%^&*|]+$"
  20. Alphanumeric filter. Write a program Filter.java to read in text from standard input and eliminate all characters that are not whitespace or alpha-numeric. Answer here's the key line.
    String output = input.replaceAll("[^\\s0-9a-zA-Z]", "");
    
  21. Converting tabs to spaces. Write a program to convert all tabs in a Java source file to 4 spaces.
  22. Parsing delimited text files. A popular way to store a database is in a text file with one record per line, and each field separated by a special character called the delimiter.
    19072/Narberth/PA/Pennsylvania
    08540/Princeton/NJ/New Jersey
    
    Write a program Tokenizer.java that reads in two command line parameters, a delimiter character and the name of the file, and creates an array of tokens.
  23. Parsing delimited text files. Repeat the previous exercise, but use the String library method split().
  24. PROSITE to Java regular expression. Write a program to read in a PROSITE pattern and print out the corresponding Java regular expression.
  25. Checking a file format.
  26. Misspellings. Write a Java program to verify that this list of common misspellings adapted from Wikipedia contains only lines of the form
    misdemenors (misdemeanors)
    mispelling (misspelling)
    tennisplayer (tennis player)
    

    where the first word is the misspelling and the string in parentheses is a possible replacement.

  27. interesting English words