Regular Expression Primer

Start presentation

Slide 1: Regular expressions

Error: can't fetch image from 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RegularExpressionPrimer/email-regexp.png': 500 Can't connect to biostat.mc.vanderbilt.edu:80 (Name or service not known)
Stolen from Jeremy Stephens - December 5, 2011

Comments

  1. Text-version: ^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

Slide 2: What are they?

Regular expressions are a way to describe patterns in text.

Comments

  1. The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.

Slide 3: When to use them

  • Finding things
  • Replacing things: c("123", "123x", "abc123")
  • Extracting data:
schriro,1,02/02/09,2009,3,0,0,0,1,1, 0,0,1,0,9,3,"o'scannlain, graber, mckeown", 4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,784,2,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0, 0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,lauren,nf 2006 WL 2865064,berry v. epps,0, 10/05/06,2006,24,0,0,0,1,1,0,1,0,0,0,1, davidson,4,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0, 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,1,0,0,0, 0,0,1,1,0,0,0,0,0,0,0,"*23-*24, *33-*35", 2,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0, 0,0,0,0,1,0,1,0,0,0,0,1,0,0,A.J.,nf

Slide 4: Things you (might) need to know

There are 3 main regular expression standards:

  • POSIX Basic Regular Expressions
  • POSIX Extended Regular Expressions (R uses this by default)
  • Perl-based Regular Expressions (R has support for this as well)

Comments

  1. Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
  2. There is a caveat when using regular expressions in R, but that will appear later on in this talk.
  3. If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
  4. Examples in this talk will be using R for demonstration.
  5. See http://en.wikipedia.org/wiki/Regular_expression#Syntax for more information

Slide 5: The simplest regular expression

Plain text!

> grep("test", c("foo", "bar", "test"))
[1] 3

  • "test" is a perfectly valid regular expression
  • Works just like CTRL+F in your browser

In Python:

>>> words = "foo", "bar", "test"
>>> [re.search("test", str) is not None for str in words]
[False, False, True]

Comments

  1. The R function grep can be used to find matches in a character vector. It returns the indices of matching strings
  2. You may have already used regular expressions without your knowledge!
  3. "test" is a pattern, but not a very complicated one. How do you describe more complex patterns?

Slide 6: Metacharacters

  • Characters that have special meaning
  • Meta, meaning "about"

Comments

  1. Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.

Slide 7: Your first metacharacter: Dot

  • Dot (or period) matches any character
    > grep("foo.bar", c("fooxbar", "foo bar", "foobar", 
          "foo12bar"))
    [1] 1 2
    
  • "foo.bar" is the regular expression

Comments

  1. You'll notice that dot matches a space
  2. What happens when you want to match two wildcard characters?

Slide 8: Matching two wildcards

> grep("foo..bar", c("fooxbar", "foo bar", "foobar", 
       "foo12bar"))
[1] 4

Comments

  1. But what if you want to match three wildcard characters? There's an easier way.

Slide 9: Variable length matching, using "plus"

> grep("foo.+bar", c("fooxbar", "foo bar", "foobar",
        "foo12bar"))
[1] 1 2 4
  • "foo.+bar" is the regular expression
  • The plus metacharacter means: "one or more times"

Comments

  1. You can use the plus metacharacter on other characters, too.

Slide 10: More "plus" usage

> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3

Slide 11: Quantifiers

  • Metacharacters that describe "how many"
  • The "plus" metacharacter is a quantifier

Comments

  1. There are a couple of other quantifiers you can use

Slide 12: Star metacharacter: "Zero or more times"

> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", 
       "foo12bar"))
[1] 1 2 3 4

> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4

Comments

  1. You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.

Slide 13: Question mark metacharacter: "Zero or one times"

> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2

Comments

  1. The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.

Slide 14: Brace metacharacter: "Exactly n times"

> grep("10{6}", c("1000", "1000000"))
[1] 2

Slide 15: Brace metacharacter improved

> grep("10{2,4}1", c("101", "1001", "10001", "100001", 
       "1000001"))
[1] 2 3 4
  • {n, m} - Match at least n times, but not more than m times

Slide 16: Anchors!

> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3
  • The third string matches!

Comments

  1. The third string matches because this regular expression is "floating". It matches anywhere in the string.
  2. Solution: use an anchor.

Slide 17: Anchor to the end with a dollar sign

  • Dollar sign metacharacter:
    > grep("10{3}$", c("100", "1000", "10000"))
    [1] 2
    

Comments

  1. The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
  2. Anchor metacharacters are special because they only specify string positions.

Slide 18: Anchor to the beginning with a caret

> grep("10{3}", c("1000", "abc1000"))
[1] 1 2
  • Caret metacharacter:
    > grep("^10{3}", c("1000", "abc1000"))
    [1] 1
    

Comments

  1. The caret "anchors" the regular expression to the beginning of the string.

Slide 19: Using both anchors

> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3

> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2

> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2

Slide 20: Character classes with brackets

  • In case you don't want to match any character by using dot:
    > grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg")) 
    [1] 1 2 3 4
    
  • Bracket metacharacter

Slide 21: Character class ranges

  • Instead of typing out all of those letters:
    > grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg")) 
    [1] 1 2 3 4
    
  • Dash metacharacter

Slide 22: Multiple ranges

> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg")) 
[1] 1 2 3 4

Slide 23: Mixing ranges and characters

> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2")) 
[1] 1 2 3 4

Slide 24: Context matters

  • Because regular expressions aren't hard enough
  • Some metacharacters mean different things depending on context

Slide 25: Context in character classes

  • Most metacharacters lose their meaning inside brackets:
    > grep("[.+*]", c(".", "+", "*", "x"))
    [1] 1 2 3
    > grep("[a-c-]", c("a", "b", "c", "-"))
    [1] 1 2 3 4
    

Slide 26: Full circle

Error: can't fetch image from 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RegularExpressionPrimer/email-regexp.png': 500 Can't connect to biostat.mc.vanderbilt.edu:80 (Name or service not known)

Comments

  1. Now, hopefully this is less scary.
  2. Let's go through this step by step.

Slide 27: Better e-mail regular expression

Error: can't fetch image from 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RegularExpressionPrimer/email-regexp-2.png': 500 Can't connect to biostat.mc.vanderbilt.edu:80 (Name or service not known)

Comments

  1. Complete with Titans colors!
  2. ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$

-- ChrisFonnesbeck - 05 Dec 2011
Topic revision: r1 - 05 Dec 2011, ChrisFonnesbeck
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback