You are here: Vanderbilt Biostatistics Wiki>Main Web>RegularExpressionPrimer (22 Dec 2015, JeremyStephens)EditAttach

Regular Expression Primer

Regular expressions are a way to describe patterns in text. You can use them to search, replace, and extract information from character data. All of the examples below are in R but can be modified to work in just about any other programming language.

Basic examples
Metacharacters
Anchors
Character classes
Grouping
More information

Basic examples

To find stuff

haystack <- c("abcdef", "anbeceddelfe", "abcdef")
grep("n.e.e.d.l.e", haystack) #=> [1] 2

To remove cruft

x <- c("123", "123 oz", "123 ounces")
sub("\\D+$", "", x) #=> [1] "123" "123" "123"

To extract data

x <- "The Predators lost in overtime with 4:43 left on the clock."
gsub("^.+(\\d{1,2}:\\d{2}).+$", "\\1", x) #=> "4:43"

Metacharacters

Metacharacters are special characters that describe patterns.

Dot - match any one character: grep("foo.bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 2
Plus - match one or more times: grep("foo.+bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 2 3
Asterisk - match zero or more times: grep("foo.*bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 1 2 3
Question mark - match zero or one time: grep("foo.?bar", c("foobar", "fooxbar", "fooxxbar")) #=> [1] 1 2
Curly braces - match exactly n times: grep("foo.{2}bar", c("foobar", "fooxbar", "fooxxbar", "fooxxxbar")) #=> [1] 3
Curly braces - match between n and m times: grep("foo.{1,2}bar", c("foobar", "fooxbar", "fooxxbar", "fooxxxbar")) #=> [1] 2 3

Quantifiers are metacharacters that describe how many and include: plus, asterisk, question mark, and curly braces.

Anchors

Anchors let you specify that the pattern starts at the beginning of the string, or ends at the end of the string. Without them, the pattern can match anywhere.

Caret - anchor at the beginning: grep("^foo", c("foo", "foobar", "barfoo")) #=> [1] 1 2
Dollar sign - anchor at the end: grep("foo$", c("foo", "foobar", "barfoo")) #=> [1] 1 3
Both - anchor at both ends: grep("^foo$", c("foo", "foobar", "barfoo")) #=> [1] 1

Character classes

You can specify a range of characters to match using brackets.

grep("[abcdef]", c("a", "d", "g")) #=> [1] 1 2
grep("[a-f]", c("a", "d", "g")) #=> [1] 1 2
grep("[a-ce-g]", c("a", "d", "g")) #=> [1] 1 3

You can negate a character class with a caret inside the brackets.

grep("[^abcdef]", c("a", "d", "g")) #=> [1] 3
grep("[^a-f]", c("a", "d", "g")) #=> [1] 3
grep("[^a-ce-g]", c("a", "d", "g")) #=> [1] 2

There are several "built-in" character classes:

Shortcut	Expanded
`\d`	`[0-9]`
`\w`	`[a-zA-Z0-9_]`
`\s`	`[ \t\n\r\f]`
`\D`	`[^0-9]`
`\W`	`[^a-zA-Z0-9_]`
`\S`	`[^ \t\n\r\f]`

Be careful when using built-in character classes in R to double escape backslashes.

grep("\d+", c("123", "abc")) #=> Error!
grep("\\d+", c("123", "abc")) #=> [1] 1

Grouping

You can use parentheses for grouping. Groups can be referred to later on using \1, \2, \3, etc.

sub("foo(.+)", "\\1", c("foobar", "foo123", "foo!@#")) #=> [1] "bar" "123" "!@#"
grep("foo(.+)_\\1", c("foobar_bar", "foo123_123")) #=> [1] 1 2

More information

-- JeremyStephens - 18 Dec 2015

Topic attachments
I	Attachment	Action	Size	Date	Who	Comment
pdf	regex-primer.pdf	manage	523.2 K	18 Dec 2015 - 10:40	JeremyStephens

Topic revision: r3 - 22 Dec 2015, JeremyStephens

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback