Regular Expression Primer

Start presentation

Slide 1: Regular expressions

email-regexp.png
Jeremy Stephens - Thursday, September 29, 2011

Comments

  1. Text-version: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

Slide 2: What are they?

Regular expressions are a way to describe patterns in text.

Comments

  1. The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.

Slide 3: When to use them

  • Finding things
  • Replacing things: c("123", "123x", "abc123")
  • Extracting data:
    "Tom Brady passed for 5 touchdowns and 456 yards, but they lost anyway."

Slide 4: Barriers to using regular expressions

scary-regexp.jpg
Photo: Carole Pasquier; Regular expression: Thomas Dupont

Comments

  1. They're scary! The above regular expression was written by Thomas in order to parse court case documents.
  2. Some good advice...

Slide 5: Advice

Dont_panic.jpg
GFX credit: Seddon

Comments

  1. All regular expressions are made up of the same building blocks.

Slide 6: Things you (might) need to know

There are 3 main regular expression standards:

  • POSIX Basic Regular Expressions
  • POSIX Extended Regular Expressions (R uses this by default)
  • Perl-based Regular Expressions (R has support for this as well)

Comments

  1. Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
  2. There is a caveat when using regular expressions in R, but that will appear later on in this talk.
  3. If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
  4. Examples in this talk will be using R for demonstration.
  5. See http://en.wikipedia.org/wiki/Regular_expression#Syntax for more information

Slide 7: The simplest regular expression

  • Plain text!
    > grep("test", c("foo", "bar", "test"))
    [1] 3
    
  • "test" is a perfectly valid regular expression
  • Works just like CTRL+F in your browser

browser-find.png

Comments

  1. The R function grep can be used to find matches in a character vector. It returns the indices of matching strings
  2. You may have already used regular expressions without your knowledge!
  3. "test" is a pattern, but not a very complicated one. How do you describe more complex patterns?

Slide 8: Metacharacters

  • Characters that have special meaning
  • Meta, meaning "about"

Comments

  1. Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.

Slide 9: Your first metacharacter: Dot

  • Dot (or period) matches any character
    > grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
    [1] 1 2
    
  • "foo.bar" is the regular expression

Comments

  1. You'll notice that dot matches a space
  2. What happens when you want to match two wildcard characters?

Slide 10: Matching two wildcards

> grep("foo..bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 4

Comments

  1. But what if you want to match three wildcard characters? There's an easier way.

Slide 11: Variable length matching, using "plus"

> grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 4
  • "foo.+bar" is the regular expression
  • The plus metacharacter means: "one or more times"

Comments

  1. You can use the plus metacharacter on other characters, too.

Slide 12: More "plus" usage

> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3

Slide 13: Quantifiers

  • Metacharacters that describe "how many"
  • The "plus" metacharacter is a quantifier

Comments

  1. There are a couple of other quantifiers you can use

Slide 14: Star metacharacter: "Zero or more times"

> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 3 4

> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4

Comments

  1. You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.

Slide 15: Question mark metacharacter: "Zero or one times"

> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2

Comments

  1. The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.

Slide 16: Brace metacharacter: "Exactly n times"

> grep("10{6}", c("1000", "1000000"))
[1] 2

Slide 17: Brace metacharacter improved

> grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001"))
[1] 2 3 4
  • {n, m} - Match at least n times, but not more than m times

Slide 18: Intermission

Questions?

Slide 19: Anchors!

> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3
  • The third string matches!

Comments

  1. The third string matches because this regular expression is "floating". It matches anywhere in the string.
  2. Solution: use an anchor.

Slide 20: Anchor to the end with a dollar sign

  • Dollar sign metacharacter:
    > grep("10{3}$", c("100", "1000", "10000"))
    [1] 2
    

Comments

  1. The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
  2. Anchor metacharacters are special because they only specify string positions.

Slide 21: Anchor to the beginning with a caret

> grep("10{3}", c("1000", "abc1000"))
[1] 1 2
  • Caret metacharacter:
    > grep("^10{3}", c("1000", "abc1000"))
    [1] 1
    

Comments

  1. The caret "anchors" the regular expression to the beginning of the string.

Slide 22: Using both anchors

> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3

> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2

> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2

Slide 23: Character classes with brackets

  • In case you don't want to match any character by using dot:
    > grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg")) 
    [1] 1 2 3 4
    
  • Bracket metacharacter

Slide 24: Character class ranges

  • Instead of typing out all of those letters:
    > grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg")) 
    [1] 1 2 3 4
    
  • Dash metacharacter

Slide 25: Multiple ranges

> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg")) 
[1] 1 2 3 4

Slide 26: Mixing ranges and characters

> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2")) 
[1] 1 2 3 4

Slide 27: Context matters

  • Because regular expressions aren't hard enough
  • Some metacharacters mean different things depending on context

Slide 28: Context in character classes

  • Most metacharacters lose their meaning inside brackets:
    > grep("[.+*]", c(".", "+", "*", "x"))
    [1] 1 2 3
    > grep("[a-c-]", c("a", "b", "c", "-"))
    [1] 1 2 3 4
    

Slide 29: Full circle

email-regexp.png

Comments

  1. Now, hopefully this is less scary.
  2. Let's go through this step by step.

Slide 30: Better e-mail regular expression

email-regexp-2.png

Comments

  1. Complete with Titans colors!
  2. ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$

-- JeremyStephens - 26 Sep 2011
Topic attachments
I Attachment ActionSorted ascending Size Date Who Comment
Dont_panic.jpgjpg Dont_panic.jpg manage 635.7 K 28 Sep 2011 - 15:58 JeremyStephens Don't panic! (From Wikimedia Commons)
browser-find.pngpng browser-find.png manage 6.5 K 28 Sep 2011 - 15:32 JeremyStephens Find from Firefox
email-regexp-2.pngpng email-regexp-2.png manage 50.0 K 29 Sep 2011 - 12:16 JeremyStephens Better e-mail regular expression
email-regexp.pngpng email-regexp.png manage 45.3 K 26 Sep 2011 - 17:31 JeremyStephens E-mail regular expression
office-sign-brain.jpgjpg office-sign-brain.jpg manage 70.7 K 29 Sep 2011 - 11:12 JeremyStephens Via http://www.happyworker.com/magazine/fun/april-fools-day-in-the-stressed-office
scary-regexp.jpgjpg scary-regexp.jpg manage 409.1 K 26 Sep 2011 - 17:31 JeremyStephens Scary regular expression
Topic revision: r7 - 18 Dec 2015, JeremyStephens
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback