Regular Expression Primer
Start presentation
Slide 1: Regular expressions
Slide 2: What are they?
Regular expressions are a way to describe patterns in text.
Comments
- The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.
Slide 3: When to use them
- Finding things
- Replacing things:
c("123", "123x", "abc123")
- Extracting data:
"Tom Brady passed for 5 touchdowns and 456 yards, but they lost anyway."
Slide 4: Barriers to using regular expressions
Comments
- They're scary! The above regular expression was written by Thomas in order to parse court case documents.
Slide 5: Advice
Comments
- All regular expressions are made up of the same building blocks.
Slide 6: Things you (might) need to know
There are 3 main regular expression standards:
- POSIX Basic Regular Expressions
- POSIX Extended Regular Expressions (R uses this by default)
- Perl-based Regular Expressions (R has support for this as well)
Comments
- Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
- There is a caveat when using regular expressions in R, but that will appear later on in this talk.
- If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
- Examples in this talk will be using R for demonstration.
- See http://en.wikipedia.org/wiki/Regular_expression#Syntax for more information
Slide 7: The simplest regular expression
Comments
- The R function
grep
can be used to find matches in a character vector. It returns the indices of matching strings
- "test" is a pattern, but not a very complicated one. How do you describe more complex patterns?
Slide 8: Metacharacters
- Characters that have special meaning
- Meta, meaning "about"
Comments
- Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.
Slide 9: Your first metacharacter: Dot
- Dot (or period) matches any character
> grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2
- "foo.bar" is the regular expression
Comments
- You'll notice that dot matches a space
- What happens when you want to match two wildcard characters?
Slide 10: Matching two wildcards
> grep("foo..bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 4
Comments
- But what if you want to match three wildcard characters? There's an easier way.
Slide 11: Variable length matching, using "plus"
> grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 4
- "foo.+bar" is the regular expression
- The plus metacharacter means: "one or more times"
Comments
- You can use the plus metacharacter on other characters, too.
Slide 12: More "plus" usage
> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3
Slide 13: Quantifiers
- Metacharacters that describe "how many"
- The "plus" metacharacter is a quantifier
Comments
- There are a couple of other quantifiers you can use
Slide 14: Star metacharacter: "Zero or more times"
> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 3 4
> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4
Comments
- You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.
Slide 15: Question mark metacharacter: "Zero or one times"
> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2
Comments
- The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.
Slide 16: Brace metacharacter: "Exactly n times"
> grep("10{6}", c("1000", "1000000"))
[1] 2
Slide 17: Brace metacharacter improved
> grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001"))
[1] 2 3 4
-
{n, m}
- Match at least n times, but not more than m times
Slide 18: Intermission
Questions?
Slide 19: Anchors!
> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3
- The third string matches!
Comments
- The third string matches because this regular expression is "floating". It matches anywhere in the string.
- Solution: use an anchor.
Slide 20: Anchor to the end with a dollar sign
Comments
- The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
- Anchor metacharacters are special because they only specify string positions.
Slide 21: Anchor to the beginning with a caret
> grep("10{3}", c("1000", "abc1000"))
[1] 1 2
Comments
- The caret "anchors" the regular expression to the beginning of the string.
Slide 22: Using both anchors
> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3
> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2
> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2
Slide 23: Character classes
--
JeremyStephens - 26 Sep 2011