You are here: Vanderbilt Biostatistics Wiki>Main Web>Education>CourseBios301>RegexLecture (05 Dec 2011, ChrisFonnesbeck)EditAttach

Regular Expression Primer

Slide 1: Regular expressions
- Comments
Slide 2: What are they?
- Comments
Slide 3: When to use them
Slide 4: Things you (might) need to know
- Comments
Slide 5: The simplest regular expression
- Comments
Slide 6: Metacharacters
- Comments
Slide 7: Your first metacharacter: Dot
- Comments
Slide 8: Matching two wildcards
- Comments
Slide 9: Variable length matching, using "plus"
- Comments
Slide 10: More "plus" usage
Slide 11: Quantifiers
- Comments
Slide 12: Star metacharacter: "Zero or more times"
- Comments
Slide 13: Question mark metacharacter: "Zero or one times"
- Comments
Slide 14: Brace metacharacter: "Exactly n times"
Slide 15: Brace metacharacter improved
Slide 16: Anchors!
- Comments
Slide 17: Anchor to the end with a dollar sign
- Comments
Slide 18: Anchor to the beginning with a caret
- Comments
Slide 19: Using both anchors
Slide 20: Character classes with brackets
Slide 21: Character class ranges
Slide 22: Multiple ranges
Slide 23: Mixing ranges and characters
Slide 24: Context matters
Slide 25: Context in character classes
Slide 26: Full circle
- Comments
Slide 27: Better e-mail regular expression
- Comments

Start presentation

Slide 1: Regular expressions

Error: can't fetch image from 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RegularExpressionPrimer/email-regexp.png': 500 Can't connect to biostat.mc.vanderbilt.edu:80 (Name or service not known)
Stolen from Jeremy Stephens - December 5, 2011

Comments

Text-version: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

Slide 2: What are they?

Regular expressions are a way to describe patterns in text.

Comments

The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.

Slide 3: When to use them

Finding things
Replacing things: c("123", "123x", "abc123")
Extracting data:

 
schriro,1,02/02/09,2009,3,0,0,0,1,1,
0,0,1,0,9,3,"o'scannlain, graber, mckeown",
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,784,2,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,lauren,nf
2006 WL 2865064,berry v. epps,0,
10/05/06,2006,24,0,0,0,1,1,0,1,0,0,0,1,
davidson,4,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,1,0,0,0,
0,0,1,1,0,0,0,0,0,0,0,"*23-*24, *33-*35",
2,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,
0,0,0,0,1,0,1,0,0,0,0,1,0,0,A.J.,nf

Slide 4: Things you (might) need to know

There are 3 main regular expression standards:

POSIX Basic Regular Expressions
POSIX Extended Regular Expressions (R uses this by default)
Perl-based Regular Expressions (R has support for this as well)

Comments

Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
There is a caveat when using regular expressions in R, but that will appear later on in this talk.
If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
Examples in this talk will be using R for demonstration.
See http://en.wikipedia.org/wiki/Regular_expression#Syntax for more information

Slide 5: The simplest regular expression

Plain text!

> grep("test", c("foo", "bar", "test")) [1] 3

"test" is a perfectly valid regular expression
Works just like CTRL+F in your browser

In Python:

>>> words = "foo", "bar", "test" >>> [re.search("test", str) is not None for str in words] [False, False, True]

Comments

The R function grep can be used to find matches in a character vector. It returns the indices of matching strings
You may have already used regular expressions without your knowledge!
"test" is a pattern, but not a very complicated one. How do you describe more complex patterns?

Slide 6: Metacharacters

Characters that have special meaning
Meta, meaning "about"

Comments

Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.

Slide 7: Your first metacharacter: Dot

Dot (or period) matches any character

> grep("foo.bar", c("fooxbar", "foo bar", "foobar", 
      "foo12bar"))
[1] 1 2

"foo.bar" is the regular expression

Comments

You'll notice that dot matches a space
What happens when you want to match two wildcard characters?

Slide 8: Matching two wildcards

> grep("foo..bar", c("fooxbar", "foo bar", "foobar", 
       "foo12bar"))
[1] 4

Comments

But what if you want to match three wildcard characters? There's an easier way.

Slide 9: Variable length matching, using "plus"

> grep("foo.+bar", c("fooxbar", "foo bar", "foobar",
        "foo12bar"))
[1] 1 2 4

"foo.+bar" is the regular expression
The plus metacharacter means: "one or more times"

Comments

You can use the plus metacharacter on other characters, too.

Slide 10: More "plus" usage

> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3

Slide 11: Quantifiers

Metacharacters that describe "how many"
The "plus" metacharacter is a quantifier

Comments

There are a couple of other quantifiers you can use

Slide 12: Star metacharacter: "Zero or more times"

> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", 
       "foo12bar"))
[1] 1 2 3 4

> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4

Comments

You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.

Slide 13: Question mark metacharacter: "Zero or one times"

> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2

Comments

The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.

Slide 14: Brace metacharacter: "Exactly n times"

> grep("10{6}", c("1000", "1000000"))
[1] 2

Slide 15: Brace metacharacter improved

> grep("10{2,4}1", c("101", "1001", "10001", "100001", 
       "1000001"))
[1] 2 3 4

{n, m} - Match at least n times, but not more than m times

Slide 16: Anchors!

> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3

The third string matches!

Comments

The third string matches because this regular expression is "floating". It matches anywhere in the string.
Solution: use an anchor.

Slide 17: Anchor to the end with a dollar sign

Dollar sign metacharacter:

> grep("10{3}$", c("100", "1000", "10000"))
[1] 2

Comments

The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
Anchor metacharacters are special because they only specify string positions.

Slide 18: Anchor to the beginning with a caret

> grep("10{3}", c("1000", "abc1000"))
[1] 1 2

Caret metacharacter:

> grep("^10{3}", c("1000", "abc1000"))
[1] 1

Comments

The caret "anchors" the regular expression to the beginning of the string.

Slide 19: Using both anchors

> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3

> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2

> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2

Slide 20: Character classes with brackets

In case you don't want to match any character by using dot:

> grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg")) 
[1] 1 2 3 4

Bracket metacharacter

Slide 21: Character class ranges

Instead of typing out all of those letters:

> grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg")) 
[1] 1 2 3 4

Dash metacharacter

Slide 22: Multiple ranges

> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg")) 
[1] 1 2 3 4

Slide 23: Mixing ranges and characters

> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2")) 
[1] 1 2 3 4

Slide 24: Context matters

Because regular expressions aren't hard enough
Some metacharacters mean different things depending on context

Slide 25: Context in character classes

Most metacharacters lose their meaning inside brackets:

> grep("[.+*]", c(".", "+", "*", "x"))
[1] 1 2 3
> grep("[a-c-]", c("a", "b", "c", "-"))
[1] 1 2 3 4

Slide 26: Full circle

Error: can't fetch image from 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RegularExpressionPrimer/email-regexp.png': 500 Can't connect to biostat.mc.vanderbilt.edu:80 (Name or service not known)

Comments

Now, hopefully this is less scary.
Let's go through this step by step.

Slide 27: Better e-mail regular expression

Error: can't fetch image from 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RegularExpressionPrimer/email-regexp-2.png': 500 Can't connect to biostat.mc.vanderbilt.edu:80 (Name or service not known)

Comments

Complete with Titans colors!
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$

-- ChrisFonnesbeck - 05 Dec 2011

Topic revision: r1 - 05 Dec 2011, ChrisFonnesbeck

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback