You are here: Vanderbilt Biostatistics Wiki>Archive Web>RegularExpressionPrimer (18 Dec 2015, JeremyStephens)Edit Attach

Regular Expression Primer

Slide 1: Regular expressions
- Comments
Slide 2: What are they?
- Comments
Slide 3: When to use them
Slide 4: Barriers to using regular expressions
- Comments
Slide 5: Advice
- Comments
Slide 6: Things you (might) need to know
- Comments
Slide 7: The simplest regular expression
- Comments
Slide 8: Metacharacters
- Comments
Slide 9: Your first metacharacter: Dot
- Comments
Slide 10: Matching two wildcards
- Comments
Slide 11: Variable length matching, using "plus"
- Comments
Slide 12: More "plus" usage
Slide 13: Quantifiers
- Comments
Slide 14: Star metacharacter: "Zero or more times"
- Comments
Slide 15: Question mark metacharacter: "Zero or one times"
- Comments
Slide 16: Brace metacharacter: "Exactly n times"
Slide 17: Brace metacharacter improved
Slide 18: Intermission
Slide 19: Anchors!
- Comments
Slide 20: Anchor to the end with a dollar sign
- Comments
Slide 21: Anchor to the beginning with a caret
- Comments
Slide 22: Using both anchors
Slide 23: Character classes with brackets
Slide 24: Character class ranges
Slide 25: Multiple ranges
Slide 26: Mixing ranges and characters
Slide 27: Context matters
Slide 28: Context in character classes
Slide 29: Full circle
- Comments
Slide 30: Better e-mail regular expression
- Comments

Start presentation

Slide 1: Regular expressions

JeremyStephens - Thursday, September 29, 2011

Comments

Text-version: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

Slide 2: What are they?

Regular expressions are a way to describe patterns in text.

Comments

The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.

Slide 3: When to use them

Finding things
Replacing things: c("123", "123x", "abc123")
Extracting data:
"Tom Brady passed for 5 touchdowns and 456 yards, but they lost anyway."

Slide 4: Barriers to using regular expressions

Photo: Carole Pasquier; Regular expression: Thomas Dupont

Comments

They're scary! The above regular expression was written by Thomas in order to parse court case documents.
Some good advice...

Slide 5: Advice

GFX credit: Seddon

Comments

All regular expressions are made up of the same building blocks.

Slide 6: Things you (might) need to know

There are 3 main regular expression standards:

POSIX Basic Regular Expressions
POSIX Extended Regular Expressions (R uses this by default)
Perl-based Regular Expressions (R has support for this as well)

Comments

Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
There is a caveat when using regular expressions in R, but that will appear later on in this talk.
If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
Examples in this talk will be using R for demonstration.
See http://en.wikipedia.org/wiki/Regular_expression#Syntax for more information

Slide 7: The simplest regular expression

Plain text!

> grep("test", c("foo", "bar", "test"))
[1] 3

"test" is a perfectly valid regular expression
Works just like CTRL+F in your browser

Comments

The R function grep can be used to find matches in a character vector. It returns the indices of matching strings
You may have already used regular expressions without your knowledge!
"test" is a pattern, but not a very complicated one. How do you describe more complex patterns?

Slide 8: Metacharacters

Characters that have special meaning
Meta, meaning "about"

Comments

Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.

Slide 9: Your first metacharacter: Dot

Dot (or period) matches any character

> grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2

"foo.bar" is the regular expression

Comments

You'll notice that dot matches a space
What happens when you want to match two wildcard characters?

Slide 10: Matching two wildcards

> grep("foo..bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 4

Comments

But what if you want to match three wildcard characters? There's an easier way.

Slide 11: Variable length matching, using "plus"

> grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 4

"foo.+bar" is the regular expression
The plus metacharacter means: "one or more times"

Comments

You can use the plus metacharacter on other characters, too.

Slide 12: More "plus" usage

> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3

Slide 13: Quantifiers

Metacharacters that describe "how many"
The "plus" metacharacter is a quantifier

Comments

There are a couple of other quantifiers you can use

Slide 14: Star metacharacter: "Zero or more times"

> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 3 4

> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4

Comments

You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.

Slide 15: Question mark metacharacter: "Zero or one times"

> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2

Comments

The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.

Slide 16: Brace metacharacter: "Exactly n times"

> grep("10{6}", c("1000", "1000000"))
[1] 2

Slide 17: Brace metacharacter improved

> grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001"))
[1] 2 3 4

{n, m} - Match at least n times, but not more than m times

Slide 18: Intermission

Questions?

Slide 19: Anchors!

> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3

The third string matches!

Comments

The third string matches because this regular expression is "floating". It matches anywhere in the string.
Solution: use an anchor.

Slide 20: Anchor to the end with a dollar sign

Dollar sign metacharacter:

> grep("10{3}$", c("100", "1000", "10000"))
[1] 2

Comments

The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
Anchor metacharacters are special because they only specify string positions.

Slide 21: Anchor to the beginning with a caret

> grep("10{3}", c("1000", "abc1000"))
[1] 1 2

Caret metacharacter:

> grep("^10{3}", c("1000", "abc1000"))
[1] 1

Comments

The caret "anchors" the regular expression to the beginning of the string.

Slide 22: Using both anchors

> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3

> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2

> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2

Slide 23: Character classes with brackets

In case you don't want to match any character by using dot:

> grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg")) 
[1] 1 2 3 4

Bracket metacharacter

Slide 24: Character class ranges

Instead of typing out all of those letters:

> grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg")) 
[1] 1 2 3 4

Dash metacharacter

Slide 25: Multiple ranges

> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg")) 
[1] 1 2 3 4

Slide 26: Mixing ranges and characters

> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2")) 
[1] 1 2 3 4

Slide 27: Context matters

Because regular expressions aren't hard enough
Some metacharacters mean different things depending on context

Slide 28: Context in character classes

Most metacharacters lose their meaning inside brackets:

> grep("[.+*]", c(".", "+", "*", "x"))
[1] 1 2 3
> grep("[a-c-]", c("a", "b", "c", "-"))
[1] 1 2 3 4

Slide 29: Full circle

Comments

Now, hopefully this is less scary.
Let's go through this step by step.

Slide 30: Better e-mail regular expression

Comments

Complete with Titans colors!
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$

-- JeremyStephens - 26 Sep 2011

I	Attachment	Action	Size	Date	Who	Comment
jpg	Dont_panic.jpg	manage	635 K	28 Sep 2011 - 15:58	JeremyStephens	Don't panic! (From [[http://commons.wikimedia.org/wiki/File:Dont_panic.jpg][Wikimedia Commons]])
png	browser-find.png	manage	6 K	28 Sep 2011 - 15:32	JeremyStephens	Find from Firefox
png	email-regexp-2.png	manage	50 K	29 Sep 2011 - 12:16	JeremyStephens	Better e-mail regular expression
png	email-regexp.png	manage	45 K	26 Sep 2011 - 17:31	JeremyStephens	E-mail regular expression
jpg	office-sign-brain.jpg	manage	70 K	29 Sep 2011 - 11:12	JeremyStephens	Via http://www.happyworker.com/magazine/fun/april-fools-day-in-the-stressed-office
jpg	scary-regexp.jpg	manage	409 K	26 Sep 2011 - 17:31	JeremyStephens	Scary regular expression

Topic revision: r7 - 18 Dec 2015, JeremyStephens