# Regular Expression Primer

Start presentation

## Slide 1: Regular expressions

Jeremy Stephens - Thursday, September 29, 2011

1. Text-version: `^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\$`

## Slide 2: What are they?

Regular expressions are a way to describe patterns in text.

1. The regular expression shown in the previous slide is for matching valid e-mail addresses. Hopefully by the end of the talk, what this does will be more clear.

## Slide 3: When to use them

• Finding things
• Replacing things: `c("123", "123x", "abc123")`
• Extracting data:
`"Tom Brady passed for 5 touchdowns and 456 yards, but they lost anyway."`

## Slide 4: Barriers to using regular expressions

Photo: Carole Pasquier; Regular expression: Thomas Dupont

1. They're scary! The above regular expression was written by Thomas in order to parse court case documents.

GFX credit: Seddon

1. All regular expressions are made up of the same building blocks.

## Slide 6: Things you (might) need to know

There are 3 main regular expression standards:

• POSIX Basic Regular Expressions
• POSIX Extended Regular Expressions (R uses this by default)
• Perl-based Regular Expressions (R has support for this as well)

1. Different programming languages support different standards. R supports ERE by default, but it also has support for Perl-based RE. (Using perl=TRUE)
2. There is a caveat when using regular expressions in R, but that will appear later on in this talk.
3. If you use regular expressions in a language and find that it doesn't behave as you expect, it may be because the language doesn't support certain features.
4. Examples in this talk will be using R for demonstration.

## Slide 7: The simplest regular expression

• Plain text!
```> grep("test", c("foo", "bar", "test"))
[1] 3
```
• "test" is a perfectly valid regular expression
• Works just like CTRL+F in your browser

1. The R function `grep` can be used to find matches in a character vector. It returns the indices of matching strings
3. "test" is a pattern, but not a very complicated one. How do you describe more complex patterns?

## Slide 8: Metacharacters

• Characters that have special meaning

1. Characters with special meaning in regular expressions are called "metacharacters". We add "meta", which means "about", because we're describing information about other characters.

## Slide 9: Your first metacharacter: Dot

• Dot (or period) matches any character
```> grep("foo.bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2
```
• "foo.bar" is the regular expression

1. You'll notice that dot matches a space
2. What happens when you want to match two wildcard characters?

## Slide 10: Matching two wildcards

```> grep("foo..bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 4
```

1. But what if you want to match three wildcard characters? There's an easier way.

## Slide 11: Variable length matching, using "plus"

```> grep("foo.+bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 4
```
• "foo.+bar" is the regular expression
• The plus metacharacter means: "one or more times"

1. You can use the plus metacharacter on other characters, too.

## Slide 12: More "plus" usage

```> grep("bar+", c("bar", "barr", "barrrrrrrr", "ba"))
[1] 1 2 3
```

## Slide 13: Quantifiers

• Metacharacters that describe "how many"
• The "plus" metacharacter is a quantifier

1. There are a couple of other quantifiers you can use

## Slide 14: Star metacharacter: "Zero or more times"

```> grep("foo.*bar", c("fooxbar", "foo bar", "foobar", "foo12bar"))
[1] 1 2 3 4
```

```> grep("bar*", c("bar", "barr", "barrrr", "ba"))
[1] 1 2 3 4
```

1. You'll notice that both "foobar" and "ba" match now, because we're matching zero or more times instead of one or more times.

## Slide 15: Question mark metacharacter: "Zero or one times"

```> grep("abc?def", c("abdef", "abcdef", "abccdef"))
[1] 1 2
```

1. The first two strings match, but the third one doesn't. If the question mark had been a star, all three strings would match.

## Slide 16: Brace metacharacter: "Exactly n times"

```> grep("10{6}", c("1000", "1000000"))
[1] 2
```

## Slide 17: Brace metacharacter improved

```> grep("10{2,4}1", c("101", "1001", "10001", "100001", "1000001"))
[1] 2 3 4
```
• `{n, m}` - Match at least n times, but not more than m times

Questions?

## Slide 19: Anchors!

```> grep("10{3}", c("100", "1000", "10000"))
[1] 2 3
```
• The third string matches!

1. The third string matches because this regular expression is "floating". It matches anywhere in the string.
2. Solution: use an anchor.

## Slide 20: Anchor to the end with a dollar sign

• Dollar sign metacharacter:
```> grep("10{3}\$", c("100", "1000", "10000"))
[1] 2
```

1. The dollar sign "anchors" the regular expression to the end of the string. Now, the third string doesn't match.
2. Anchor metacharacters are special because they only specify string positions.

## Slide 21: Anchor to the beginning with a caret

```> grep("10{3}", c("1000", "abc1000"))
[1] 1 2
```
• Caret metacharacter:
```> grep("^10{3}", c("1000", "abc1000"))
[1] 1
```

1. The caret "anchors" the regular expression to the beginning of the string.

## Slide 22: Using both anchors

```> grep("10{3}", c("abc1000", "1000", "10000"))
[1] 1 2 3
```

```> grep("10{3}\$", c("abc1000", "1000", "10000"))
[1] 1 2
```

```> grep("^10{3}\$", c("abc1000", "1000", "10000"))
[1] 2
```

## Slide 23: Character classes with brackets

• In case you don't want to match any character by using dot:
```> grep("ab[cdef]", c("abc", "abd", "abe", "abf", "abg"))
[1] 1 2 3 4
```
• Bracket metacharacter

## Slide 24: Character class ranges

• Instead of typing out all of those letters:
```> grep("ab[c-f]", c("abc", "abd", "abe", "abf", "abg"))
[1] 1 2 3 4
```
• Dash metacharacter

## Slide 25: Multiple ranges

```> grep("ab[c-fC-F]", c("abc", "abC", "abf", "abF", "abg"))
[1] 1 2 3 4
```

## Slide 26: Mixing ranges and characters

```> grep("ab[c-fC-F123]", c("abc", "abF", "ab1", "ab2"))
[1] 1 2 3 4
```

## Slide 27: Context matters

• Because regular expressions aren't hard enough
• Some metacharacters mean different things depending on context

## Slide 28: Context in character classes

• Most metacharacters lose their meaning inside brackets:
```> grep("[.+*]", c(".", "+", "*", "x"))
[1] 1 2 3
> grep("[a-c-]", c("a", "b", "c", "-"))
[1] 1 2 3 4
```

## Slide 29: Full circle

1. Now, hopefully this is less scary.
2. Let's go through this step by step.

## Slide 30: Better e-mail regular expression

1. Complete with Titans colors!
2. `^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\$`

-- JeremyStephens - 26 Sep 2011
Topic attachments
I Attachment Action Size Date Who Comment
jpg Dont_panic.jpg manage 635.7 K 28 Sep 2011 - 15:58 JeremyStephens Don't panic! (From Wikimedia Commons)
png browser-find.png manage 6.5 K 28 Sep 2011 - 15:32 JeremyStephens Find from Firefox
png email-regexp-2.png manage 50.0 K 29 Sep 2011 - 12:16 JeremyStephens Better e-mail regular expression
png email-regexp.png manage 45.3 K 26 Sep 2011 - 17:31 JeremyStephens E-mail regular expression
jpg office-sign-brain.jpg manage 70.7 K 29 Sep 2011 - 11:12 JeremyStephens Via http://www.happyworker.com/magazine/fun/april-fools-day-in-the-stressed-office
jpg scary-regexp.jpg manage 409.1 K 26 Sep 2011 - 17:31 JeremyStephens Scary regular expression
Topic revision: r7 - 18 Dec 2015, JeremyStephens

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback