Regular Expression Primer

Regular expressions are a way to describe patterns in text. You can use them to search, replace, and extract information from character data. All of the examples below are in R but can be modified to work in just about any other programming language.

Basic examples

  • To find stuff
    haystack <- c("abcdef", "anbeceddelfe", "abcdef")
    grep("n.e.e.d.l.e", haystack) #=> [1] 2
    
  • To remove cruft
    x <- c("123", "123 oz", "123 ounces")
    sub("\\D+$", "", x) #=> [1] "123" "123" "123"
    
  • To extract data
    x <- "The Predators lost in overtime with 4:43 left on the clock."
    gsub("^.+(\\d{1,2}:\\d{2}).+$", "\\1", x) #=> "4:43"
    

Metacharacters

Metacharacters are special characters that describe patterns.

  • Dot - match any one character: grep("foo.bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 2
  • Plus - match one or more times: grep("foo.+bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 2 3
  • Asterisk - match zero or more times: grep("foo.*bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 1 2 3
  • Question mark - match zero or one time: grep("foo.?bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 1 2
  • Curly braces - match exactly n times: grep("foo.{2}bar", c("foobar", "fooxbar", "fooxxbar", "fooxxxbar")) #=> [1] 3
  • Curly braces - match between n and m times: grep("foo.{1,2}bar", c("foobar", "fooxbar", "fooxxbar", "fooxxxbar")) #=> [1] 2 3

Quantifiers are metacharacters that describe how many and include: plus, asterisk, question mark, and curly braces.

Anchors

Anchors let you specify that the pattern starts at the beginning of the string, or ends at the end of the string. Without them, the pattern can match anywhere.

  • Caret - anchor at the beginning: grep("^foo", c("foo", "foobar", "barfoo")) #=> [1] 1 2
  • Dollar sign - anchor at the end: grep("foo$", c("foo", "foobar", "barfoo")) #=> [1] 1 3
  • Both - anchor at both ends: grep("^foo$", c("foo", "foobar", "barfoo")) #=> [1] 1

Character classes

You can specify a range of characters to match using brackets.
grep("[abcdef]", c("a", "d", "g")) #=> [1] 1 2
grep("[a-f]", c("a", "d", "g")) #=> [1] 1 2
grep("[a-ce-g]", c("a", "d", "g")) #=> [1] 1 3

You can negate a character class with a caret inside the brackets.

grep("[^abcdef]", c("a", "d", "g")) #=> [1] 3
grep("[^a-f]", c("a", "d", "g")) #=> [1] 3
grep("[^a-ce-g]", c("a", "d", "g")) #=> [1] 2

There are several "built-in" character classes:

Shortcut Expanded
\d [0-9]
\w [a-zA-Z0-9_]
\s [ \t\n\r\f]
\D [^0-9]
\W [^a-zA-Z0-9_]
\S [^ \t\n\r\f]

Be careful when using built-in character classes in R to double escape backslashes.

grep("\d+", c("123", "abc")) #=> Error!
grep("\\d+", c("123", "abc")) #=> [1] 1

Grouping

You can use parentheses for grouping. Groups can be referred to later on using \1, \2, \3, etc.

sub("foo(.+)", "\\1", c("foobar", "foo123", "foo!@#")) #=> [1] "bar" "123" "!@#"
grep("foo(.+)_\\1", c("foobar_bar", "foo123_123")) #=> [1] 1 2

More information

-- JeremyStephens - 18 Dec 2015
Topic revision: r3 - 22 Dec 2015, JeremyStephens
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback