Regular Expression Primer
Slide 1: Regular expressions
Comments
- Text-version:
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
Slide 2: What are they?
Regular expressions are a way to describe patterns in text.
Comments
- The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.
Slide 3: When to use them
- Finding things
- Replacing things:
c("123", "123x", "abc123")
- Extracting data:
"Tom Brady passed for 5 touchdowns and 456 yards, but they lost anyway."
Slide 4: Barriers to using regular expressions
Comments
- They're scary! The above regular expression was written by Thomas in order to parse court case documents.
- Some good advice...
Slide 5: Advice
Comments
- All regular expressions are made up of the same building blocks.
Slide 6: Things you (might) need to know
There are 3 main regular expression standards:
- POSIX Basic Regular Expressions
- POSIX Extended Regular Expressions (R uses this by default)
- Perl-based Regular Expressions (R has support for this as well)
Comments
- Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
- There is a caveat when using regular expressions in R, but that will appear later on in this talk.
- If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
- Examples in this talk will be using R for demonstration.
- See http://en.wikipedia.org/wiki/Regular_expression#Syntax for more information
Slide 7: The simplest regular expression
Comments
- The R function
grep
can be used to find matches in a character vector. It returns the indices of matching strings
- You may have already used regular expressions without your knowledge!
- "test" is a pattern, but not a very complicated one. How do you describe more complex patterns?
Slide 8: Metacharacters
- Characters that have special meaning
- Meta, meaning "about"
Comments
- Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.
Slide 9: Your first metacharacter: Dot
- Dot (or period) matches any character
> grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2
- "foo.bar" is the regular expression
Comments
- You'll notice that dot matches a space
- What happens when you want to match two wildcard characters?
Slide 10: Matching two wildcards
> grep("foo..bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 4
Comments
- But what if you want to match three wildcard characters? There's an easier way.
Slide 11: Variable length matching, using "plus"
> grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 4
- "foo.+bar" is the regular expression
- The plus metacharacter means: "one or more times"
Comments
- You can use the plus metacharacter on other characters, too.
Slide 12: More "plus" usage
> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3
Slide 13: Quantifiers
- Metacharacters that describe "how many"
- The "plus" metacharacter is a quantifier
Comments
- There are a couple of other quantifiers you can use
Slide 14: Star metacharacter: "Zero or more times"
> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 3 4
> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4
Comments
- You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.
Slide 15: Question mark metacharacter: "Zero or one times"
> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2
Comments
- The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.
Slide 16: Brace metacharacter: "Exactly n times"
> grep("10{6}", c("1000", "1000000"))
[1] 2
Slide 17: Brace metacharacter improved
> grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001"))
[1] 2 3 4
-
{n, m}
- Match at least n times, but not more than m times
Slide 18: Intermission
Questions?
Slide 19: Anchors!
> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3
- The third string matches!
Comments
- The third string matches because this regular expression is "floating". It matches anywhere in the string.
- Solution: use an anchor.
Slide 20: Anchor to the end with a dollar sign
Comments
- The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
- Anchor metacharacters are special because they only specify string positions.
Slide 21: Anchor to the beginning with a caret
> grep("10{3}", c("1000", "abc1000"))
[1] 1 2
Comments
- The caret "anchors" the regular expression to the beginning of the string.
Slide 22: Using both anchors
> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3
> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2
> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2
Slide 23: Character classes with brackets
- In case you don't want to match any character by using dot:
> grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg"))
[1] 1 2 3 4
- Bracket metacharacter
Slide 24: Character class ranges
- Instead of typing out all of those letters:
> grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg"))
[1] 1 2 3 4
- Dash metacharacter
Slide 25: Multiple ranges
> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg"))
[1] 1 2 3 4
Slide 26: Mixing ranges and characters
> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2"))
[1] 1 2 3 4
Slide 27: Context matters
- Because regular expressions aren't hard enough
- Some metacharacters mean different things depending on context
Slide 28: Context in character classes
- Most metacharacters lose their meaning inside brackets:
> grep("[.+*]", c(".", "+", "*", "x"))
[1] 1 2 3
> grep("[a-c-]", c("a", "b", "c", "-"))
[1] 1 2 3 4
Slide 29: Full circle
Comments
- Now, hopefully this is less scary.
- Let's go through this step by step.
Slide 30: Better e-mail regular expression
Comments
- Complete with Titans colors!
-
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$
--
JeremyStephens - 26 Sep 2011