Introduction

Regular expressions are a fantastic tool for filtering and even extracting information out of strings of characters such as site codes, titles, or even entire documents. Regular expressions follow a custom syntax that we’ll need to learn but they are worth learning because:

But they’re somethig that you only need to learn a bit of to get a lot of value out of them. I often use fairly simple regular expressions, like the ones we used on the command line,

ls *.Rmd

Learning Outcomes

Students should:

Lesson

Earlier this week, we used some simple regular expression on the command line (terminal). The same type of operations we used on the command line work in R:

getwd() # Like pwd()
## [1] "/Users/bryce/src/oss-lessons/regular-expressions"
dir() # Like `ls()`
## [1] "regular-expressions.html" "regular-expressions.Rmd" 
## [3] "site_data.csv"
library(stringr)
str_view_all(dir(), ".*Rmd")
str_view_all(dir(), ".*html")

Let’s start off with a simple example of where simpler methods won’t work and see how regular expressions can be used to get what we need done. Let’s say we just received some data we need to analyze and we find this:

site_data <- read.csv("site_data.csv", stringsAsFactors = FALSE)
site_data
##                            x    temp_c
## 1          2000-copany bay-2  9.247435
## 2  2001-choctawhatchee bay-2 29.170777
## 3         2002-aransas bay-3 62.351057
## 4  2003-choctawhatchee bay-4 89.888624
## 5  2004-choctawhatchee bay-4 96.958163
## 6  2005-choctawhatchee bay-2 49.894849
## 7  2006-choctawhatchee bay-4 53.401312
## 8       2007-galveston bay-2 54.335877
## 9  2008-choctawhatchee bay-1  7.279786
## 10         2009-copany bay-3  8.806454
## 11 2000-choctawhatchee bay-2 21.353557
## 12         2001-copany bay-1  3.229220
## 13      2002-galveston bay-4 71.312880
## 14 2003-choctawhatchee bay-4 42.640502
## 15         2004-copany bay-1 92.070634
## 16      2005-galveston bay-1 95.573717
## 17 2006-choctawhatchee bay-3 99.215221
## 18         2007-copany bay-1 80.570198
## 19      2008-galveston bay-1 98.582018
## 20      2009-galveston bay-2 46.406132
## 21      2000-galveston bay-1 39.235548
## 22 2001-choctawhatchee bay-2 95.831200
## 23         2002-copany bay-1 49.300697
## 24         2003-copany bay-3 29.875656
## 25         2004-copany bay-1 58.682873
## 26         2005-copany bay-1 19.943774
## 27      2006-galveston bay-4 55.101811
## 28      2007-galveston bay-4 77.270760
## 29         2008-copany bay-2 25.395214
## 30      2009-galveston bay-4 51.966997
## 31 2000-choctawhatchee bay-4 83.095610
## 32         2001-copany bay-2 40.698851
## 33      2002-galveston bay-3 24.973809
## 34 2003-choctawhatchee bay-1 44.232596
## 35      2004-galveston bay-3 59.023020
## 36 2005-choctawhatchee bay-4 59.439810
## 37         2006-copany bay-3 63.713113
## 38         2007-copany bay-1 63.446845
## 39 2008-choctawhatchee bay-2 12.215281
## 40      2009-galveston bay-4 24.810948

It looks like the author of the dataset mixed the year of measurements, site code (e.g., A, CCCC, etc.), and some sub-site code (e.g., 1, 2, 3, etc.) into a single column. If we wanted to, for example, calculate mean temperature by site, we’d need to split these up somehow into separate columns. How could we go about this? We could start with substr which lets us slice a string by its indices:

substr(site_data$x, 1, 4)
##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [21] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [31] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
substr(site_data$x, 5, 16)
##  [1] "-copany bay-" "-choctawhatc" "-aransas bay" "-choctawhatc"
##  [5] "-choctawhatc" "-choctawhatc" "-choctawhatc" "-galveston b"
##  [9] "-choctawhatc" "-copany bay-" "-choctawhatc" "-copany bay-"
## [13] "-galveston b" "-choctawhatc" "-copany bay-" "-galveston b"
## [17] "-choctawhatc" "-copany bay-" "-galveston b" "-galveston b"
## [21] "-galveston b" "-choctawhatc" "-copany bay-" "-copany bay-"
## [25] "-copany bay-" "-copany bay-" "-galveston b" "-galveston b"
## [29] "-copany bay-" "-galveston b" "-choctawhatc" "-copany bay-"
## [33] "-galveston b" "-choctawhatc" "-galveston b" "-choctawhatc"
## [37] "-copany bay-" "-copany bay-" "-choctawhatc" "-galveston b"

But we’d quickly find that, because the number of characters in the site code varies from one to four, we can’t extract just the site code. These are the types of problems where regular expressions come in handy.

Before we start, we’re going to use the str_view_all function from the stringr package which shows a nice display of the result of executing a regular expression against our strings. In real use, we would use another function to actually get and work with the result.

library(stringr)
str_view_all(site_data$x, "[a-z ]+")

The expression we used above, [a-z ]+, is equivalent to asking for the first consecutive run of the letters a-z or " " (a space) in the entire string of characters. This is the type of problem regular expression were created for!

Overview of Regular Expressions

Regular expressions can match things literally, e.g.,

str_detect("grouper", "striper")
## [1] FALSE
str_detect("grouper", "grouper")
## [1] TRUE

but they also support a large set of special characters:

fish <- c("grouper", "striper", "sheepshead")
str_view_all(fish, ".p")

If you actually want to match a period and not any character, you have to do what’s called escaping:

fish <- c("stripers", "striper.", "grouper")
str_view_all(fish, "striper\\.")

See how that regular expression only matched the striper with the period at the end and not the string stripers?

fish <- c("grouper", "striper", "sheepshead")
str_view_all(fish, "[aeiou]")
fish <- c("grouper", "striper", "sheepshead")
str_view_all(fish, "[^aeiou]")
fish <- c("gag grouper", "striper", "red drum")
str_view_all(fish, "\\s") # Note the double \\ before the s. This is an R-specific thing.
                           # many of our special characters must be preceded by a \\
str_view_all(fish, "\\S")

Note that the lower case version \\s selects any whitespace characters, whereas the uppercase version \\S selects all non-whitespace characters. The next pattern is analogous for digits:

fish <- c("striper1", "red drum2", "tarpon123")
str_view_all(fish, "\\d")
fish <- c("striper1", "red drum2", "tarpon123")
str_view_all(fish, "\\w")

We can also specify how many of a particular character or class of character to match:

Say we want to get just the phone numbers out of this vector but we notice that the phone numbers take on some different formats:

phone_numbers <- c("219 733 8965", "apple", "329-293-8753 ", "123banana", "595.794.7569", "3872876718")
str_view_all(phone_numbers, "\\d\\d\\d[ \\.-]?\\d\\d\\d[ \\.-]?\\d\\d\\d\\d")

The above regular expression matches the number parts of the phone numbers, which can be separated by zero or one space (`),., or-`.

We can use the + expression to find words with one or more vowels:

fish <- c("gag grouper", "striper", "red drum", "cobia", "sheepshead")
str_view_all(fish, "[aeiuo]+")

and the * is zero or more`.

numbers <- c("0.2", "123.1", "547")
str_view_all(numbers, "\\d*\\.?\\d*")
# Regular expressions are greedy
letters <- "abcdefghijkc"
str_view_all(letters, "a.*c") # Greedy
str_view_all(letters, "a.*?c") # Lazy

One of the most powerful parts of regular expressions is grouping. Grouping allows us to split up our matched expressions and do more work with them. For example, we can create match the city and state in a set of addresses, splitting it into components:

addresses <- c("Santa Barbara, CA", "Seattle, WA", "New York, NY")
str_view_all(addresses, "([\\w\\s]+), (\\w+)")
str_match_all(addresses, "([\\w\\s]+), (\\w+)")
## [[1]]
##      [,1]                [,2]            [,3]
## [1,] "Santa Barbara, CA" "Santa Barbara" "CA"
## 
## [[2]]
##      [,1]          [,2]      [,3]
## [1,] "Seattle, WA" "Seattle" "WA"
## 
## [[3]]
##      [,1]           [,2]       [,3]
## [1,] "New York, NY" "New York" "NY"

Once we use groups, (), we can also use back references to work with the result. Back references are \ and a number, where \1 is the first thing in (), \2 is the second thing in (), and so on.

str_replace(addresses, "([\\w\\s]+), (\\w+)", "City: \\1, State: \\2")
## [1] "City: Santa Barbara, State: CA" "City: Seattle, State: WA"      
## [3] "City: New York, State: NY"

It can also be really useful to make a say something like “strings that start with a capital letter” or “strings that end with a period”:

possible_sentences <- c(
  "This might be a sentence.",
  "So. Might. this",
  "but this could maybe not be?",
  "Am I a sentence?",
  "maybe not",
  "Regular expressions are useful!"
)
# ^ specifies the start so ^[A-z] means "starts with a capital letter""
str_detect(possible_sentences, "^[A-Z]")
## [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE
possible_sentences[str_detect(possible_sentences, "^[A-Z]")]
## [1] "This might be a sentence."       "So. Might. this"                
## [3] "Am I a sentence?"                "Regular expressions are useful!"
# We can also do "ends with a period"
str_detect(possible_sentences, "\\.$")
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE
possible_sentences[str_detect(possible_sentences, "\\.$")]
## [1] "This might be a sentence."
# We can put them together:
str_detect(possible_sentences, "^[A-Z].*[\\.\\?!]$")
## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE
possible_sentences[str_detect(possible_sentences, "^[A-Z].*[\\.\\?!]$")]
## [1] "This might be a sentence."       "Am I a sentence?"               
## [3] "Regular expressions are useful!"

Finish out our example together

Now that we’ve gone over some basics of regular expressions, let’s finish our example by splitting the various components of column x into a year, site, and sub_site column:

site_data
##                            x    temp_c
## 1          2000-copany bay-2  9.247435
## 2  2001-choctawhatchee bay-2 29.170777
## 3         2002-aransas bay-3 62.351057
## 4  2003-choctawhatchee bay-4 89.888624
## 5  2004-choctawhatchee bay-4 96.958163
## 6  2005-choctawhatchee bay-2 49.894849
## 7  2006-choctawhatchee bay-4 53.401312
## 8       2007-galveston bay-2 54.335877
## 9  2008-choctawhatchee bay-1  7.279786
## 10         2009-copany bay-3  8.806454
## 11 2000-choctawhatchee bay-2 21.353557
## 12         2001-copany bay-1  3.229220
## 13      2002-galveston bay-4 71.312880
## 14 2003-choctawhatchee bay-4 42.640502
## 15         2004-copany bay-1 92.070634
## 16      2005-galveston bay-1 95.573717
## 17 2006-choctawhatchee bay-3 99.215221
## 18         2007-copany bay-1 80.570198
## 19      2008-galveston bay-1 98.582018
## 20      2009-galveston bay-2 46.406132
## 21      2000-galveston bay-1 39.235548
## 22 2001-choctawhatchee bay-2 95.831200
## 23         2002-copany bay-1 49.300697
## 24         2003-copany bay-3 29.875656
## 25         2004-copany bay-1 58.682873
## 26         2005-copany bay-1 19.943774
## 27      2006-galveston bay-4 55.101811
## 28      2007-galveston bay-4 77.270760
## 29         2008-copany bay-2 25.395214
## 30      2009-galveston bay-4 51.966997
## 31 2000-choctawhatchee bay-4 83.095610
## 32         2001-copany bay-2 40.698851
## 33      2002-galveston bay-3 24.973809
## 34 2003-choctawhatchee bay-1 44.232596
## 35      2004-galveston bay-3 59.023020
## 36 2005-choctawhatchee bay-4 59.439810
## 37         2006-copany bay-3 63.713113
## 38         2007-copany bay-1 63.446845
## 39 2008-choctawhatchee bay-2 12.215281
## 40      2009-galveston bay-4 24.810948
# I'll show you how to extract the year part
site_data$year <- str_extract(site_data$x, "\\d{4}")

# You do the rest
site_data$site <- str_extract(site_data$x, "") # <- Fill this in between the ""
site_data$plot <- str_extract(site_data$x, "") # <- Fill this in between the ""

Common R functions that use regular expressions

Another example

Data often come to us in strange forms and, before we can even begin analyzing the data, we have to do a lot of work to sanitize what we’ve been given. An example, which I just got the other week were temporal data with dates formatted like this:

dates <- c("1July17",
           "02July2017",
           "3July17",
           "4July17")

and so on like that. Do you see how the day of the month and year are represented in different ways through the series? If we want to convert these strings into Date objects for further analysis, we’ll to do some pre-cleaning before we can do that conversion. Regular expressions work great here.

str_match_all(dates, "(\\d{1,2})([A-Za-z]+)(\\d{2,4})")
## [[1]]
##      [,1]      [,2] [,3]   [,4]
## [1,] "1July17" "1"  "July" "17"
## 
## [[2]]
##      [,1]         [,2] [,3]   [,4]  
## [1,] "02July2017" "02" "July" "2017"
## 
## [[3]]
##      [,1]      [,2] [,3]   [,4]
## [1,] "3July17" "3"  "July" "17"
## 
## [[4]]
##      [,1]      [,2] [,3]   [,4]
## [1,] "4July17" "4"  "July" "17"

That above regular expression was complex. Let’s break it down into its main parts. Below, I’ve re-formatted the data and the regular expression a bit so we can see what’s going on.

|---------------|---------------| ---------|
| 1             | July          | 17       |
| 02            | July          | 2017     |
| 3             | July          | 17       |
| 4             | July          | 17       |
|---------------|---------------| ---------|
| \\d{1,2}      | [A-Za-z]+     | \\d{2,4} |
|---------------|---------------| ---------|

Summary

  • Regular expressions are a crucial tool in the data analysis toolbox
  • Regular expressions help us solve problems we may not be otherwise able to solve
  • Regular expressions are supported in many functions in R

More

  • Have the group figure out you can put * onto [].
  • Have the group str_split on fixed chars and a regex

Appendicies

Here’s the code I used to generate the fake site_data data.frame above.