End presentation

SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Regular expressions

email-regexp.png
Jeremy Stephens - Thursday, September 29, 2011





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!What are they?

Regular expressions are a way to describe patterns in text.





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!When to use them

  • Finding things
  • Replacing things: c("123", "123x", "abc123")
  • Extracting data:
    "Tom Brady passed for 5 touchdowns and 456 yards, but they lost anyway."





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Barriers to using regular expressions

scary-regexp.jpg
Photo: Carole Pasquier; Regular expression: Thomas Dupont





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Advice

Dont_panic.jpg
GFX credit: Seddon





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Things you (might) need to know

There are 3 main regular expression standards:

  • POSIX Basic Regular Expressions
  • POSIX Extended Regular Expressions (R uses this by default)
  • Perl-based Regular Expressions (R has support for this as well)





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!The simplest regular expression

  • Plain text!
    > grep("test", c("foo", "bar", "test"))
    [1] 3
    
  • "test" is a perfectly valid regular expression
  • Works just like CTRL+F in your browser

browser-find.png





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Metacharacters

  • Characters that have special meaning
  • Meta, meaning "about"





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Your first metacharacter: Dot

  • Dot (or period) matches any character
    > grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
    [1] 1 2
    
  • "foo.bar" is the regular expression





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Matching two wildcards

> grep("foo..bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 4





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Variable length matching, using "plus"

> grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 4
  • "foo.+bar" is the regular expression
  • The plus metacharacter means: "one or more times"





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!More "plus" usage

> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Quantifiers

  • Metacharacters that describe "how many"
  • The "plus" metacharacter is a quantifier





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Star metacharacter: "Zero or more times"

> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 3 4

> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Question mark metacharacter: "Zero or one times"

> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Brace metacharacter: "Exactly n times"

> grep("10{6}", c("1000", "1000000"))
[1] 2





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Brace metacharacter improved

> grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001"))
[1] 2 3 4
  • {n, m} - Match at least n times, but not more than m times





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Intermission

Questions?





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Anchors!

> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3
  • The third string matches!





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Anchor to the end with a dollar sign

  • Dollar sign metacharacter:
    > grep("10{3}$", c("100", "1000", "10000"))
    [1] 2
    





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Anchor to the beginning with a caret

> grep("10{3}", c("1000", "abc1000"))
[1] 1 2
  • Caret metacharacter:
    > grep("^10{3}", c("1000", "abc1000"))
    [1] 1
    





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Using both anchors

> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3

> grep("10{3}$", c("abc1000", "1000", "10000"))
[1] 1 2

> grep("^10{3}$", c("abc1000", "1000", "10000"))
[1] 2





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Character classes with brackets

  • In case you don't want to match any character by using dot:
    > grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg")) 
    [1] 1 2 3 4
    
  • Bracket metacharacter





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Character class ranges

  • Instead of typing out all of those letters:
    > grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg")) 
    [1] 1 2 3 4
    
  • Dash metacharacter





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Multiple ranges

> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg")) 
[1] 1 2 3 4





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Mixing ranges and characters

> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2")) 
[1] 1 2 3 4





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Context matters

  • Because regular expressions aren't hard enough
  • Some metacharacters mean different things depending on context





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Context in character classes

  • Most metacharacters lose their meaning inside brackets:
    > grep("[.+*]", c(".", "+", "*", "x"))
    [1] 1 2 3
    > grep("[a-c-]", c("a", "b", "c", "-"))
    [1] 1 2 3 4
    





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Full circle

email-regexp.png





















SlideShowPlugin Error:

Slide template topic Archive.BiostatSlideTemplate not found or empty!Better e-mail regular expression

email-regexp-2.png





















First slide Previous slide End presentation