Regular expressions are a fantastic tool for filtering and even extracting information out of strings of characters such as site codes, titles, or even entire documents. Regular expressions follow a custom syntax that we’ll need to learn but they are worth learning because:
But they’re somethig that you only need to learn a bit of to get a lot of value out of them. I often use fairly simple regular expressions, like the ones we used on the command line,
ls *.Rmd
Students should:
Earlier this week, we used some simple regular expression on the command line (terminal). The same type of operations we used on the command line work in R:
getwd() # Like pwd()
## [1] "/Users/bryce/src/oss-lessons/regular-expressions"
dir() # Like `ls()`
## [1] "regular-expressions.html" "regular-expressions.Rmd"
## [3] "site_data.csv"
library(stringr)
str_view_all(dir(), ".*Rmd")
str_view_all(dir(), ".*html")
Let’s start off with a simple example of where simpler methods won’t work and see how regular expressions can be used to get what we need done. Let’s say we just received some data we need to analyze and we find this:
site_data <- read.csv("site_data.csv", stringsAsFactors = FALSE)
site_data
## x temp_c
## 1 2000-copany bay-2 9.247435
## 2 2001-choctawhatchee bay-2 29.170777
## 3 2002-aransas bay-3 62.351057
## 4 2003-choctawhatchee bay-4 89.888624
## 5 2004-choctawhatchee bay-4 96.958163
## 6 2005-choctawhatchee bay-2 49.894849
## 7 2006-choctawhatchee bay-4 53.401312
## 8 2007-galveston bay-2 54.335877
## 9 2008-choctawhatchee bay-1 7.279786
## 10 2009-copany bay-3 8.806454
## 11 2000-choctawhatchee bay-2 21.353557
## 12 2001-copany bay-1 3.229220
## 13 2002-galveston bay-4 71.312880
## 14 2003-choctawhatchee bay-4 42.640502
## 15 2004-copany bay-1 92.070634
## 16 2005-galveston bay-1 95.573717
## 17 2006-choctawhatchee bay-3 99.215221
## 18 2007-copany bay-1 80.570198
## 19 2008-galveston bay-1 98.582018
## 20 2009-galveston bay-2 46.406132
## 21 2000-galveston bay-1 39.235548
## 22 2001-choctawhatchee bay-2 95.831200
## 23 2002-copany bay-1 49.300697
## 24 2003-copany bay-3 29.875656
## 25 2004-copany bay-1 58.682873
## 26 2005-copany bay-1 19.943774
## 27 2006-galveston bay-4 55.101811
## 28 2007-galveston bay-4 77.270760
## 29 2008-copany bay-2 25.395214
## 30 2009-galveston bay-4 51.966997
## 31 2000-choctawhatchee bay-4 83.095610
## 32 2001-copany bay-2 40.698851
## 33 2002-galveston bay-3 24.973809
## 34 2003-choctawhatchee bay-1 44.232596
## 35 2004-galveston bay-3 59.023020
## 36 2005-choctawhatchee bay-4 59.439810
## 37 2006-copany bay-3 63.713113
## 38 2007-copany bay-1 63.446845
## 39 2008-choctawhatchee bay-2 12.215281
## 40 2009-galveston bay-4 24.810948
It looks like the author of the dataset mixed the year of measurements, site code (e.g., A, CCCC, etc.), and some sub-site code (e.g., 1, 2, 3, etc.) into a single column. If we wanted to, for example, calculate mean temperature by site, we’d need to split these up somehow into separate columns. How could we go about this? We could start with substr
which lets us slice a string by its indices:
substr(site_data$x, 1, 4)
## [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [21] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [31] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
substr(site_data$x, 5, 16)
## [1] "-copany bay-" "-choctawhatc" "-aransas bay" "-choctawhatc"
## [5] "-choctawhatc" "-choctawhatc" "-choctawhatc" "-galveston b"
## [9] "-choctawhatc" "-copany bay-" "-choctawhatc" "-copany bay-"
## [13] "-galveston b" "-choctawhatc" "-copany bay-" "-galveston b"
## [17] "-choctawhatc" "-copany bay-" "-galveston b" "-galveston b"
## [21] "-galveston b" "-choctawhatc" "-copany bay-" "-copany bay-"
## [25] "-copany bay-" "-copany bay-" "-galveston b" "-galveston b"
## [29] "-copany bay-" "-galveston b" "-choctawhatc" "-copany bay-"
## [33] "-galveston b" "-choctawhatc" "-galveston b" "-choctawhatc"
## [37] "-copany bay-" "-copany bay-" "-choctawhatc" "-galveston b"
But we’d quickly find that, because the number of characters in the site code varies from one to four, we can’t extract just the site code. These are the types of problems where regular expressions come in handy.
Before we start, we’re going to use the str_view_all
function from the stringr
package which shows a nice display of the result of executing a regular expression against our strings. In real use, we would use another function to actually get and work with the result.
library(stringr)
str_view_all(site_data$x, "[a-z ]+")
The expression we used above, [a-z ]+
, is equivalent to asking for the first consecutive run of the letters a-z or " " (a space) in the entire string of characters. This is the type of problem regular expression were created for!
Regular expressions can match things literally, e.g.,
str_detect("grouper", "striper")
## [1] FALSE
str_detect("grouper", "grouper")
## [1] TRUE
but they also support a large set of special characters:
.
: Match any characterfish <- c("grouper", "striper", "sheepshead")
str_view_all(fish, ".p")
If you actually want to match a period and not any character, you have to do what’s called escaping:
fish <- c("stripers", "striper.", "grouper")
str_view_all(fish, "striper\\.")
See how that regular expression only matched the striper with the period at the end and not the string stripers?
[]
: Match any character in this setfish <- c("grouper", "striper", "sheepshead")
str_view_all(fish, "[aeiou]")
[^]
: Match any character not in this setfish <- c("grouper", "striper", "sheepshead")
str_view_all(fish, "[^aeiou]")
\s
& \S
: Match any whitespace (e.g., `,
`)fish <- c("gag grouper", "striper", "red drum")
str_view_all(fish, "\\s") # Note the double \\ before the s. This is an R-specific thing.
# many of our special characters must be preceded by a \\
str_view_all(fish, "\\S")
Note that the lower case version \\s
selects any whitespace characters, whereas the uppercase version \\S
selects all non-whitespace characters. The next pattern is analogous for digits:
\d
& \D
: Match any digit, equivalent to [0-9]
fish <- c("striper1", "red drum2", "tarpon123")
str_view_all(fish, "\\d")
\w
& \W
: Match any word character, equivalent to [A-Za-z0-9_]
fish <- c("striper1", "red drum2", "tarpon123")
str_view_all(fish, "\\w")
We can also specify how many of a particular character or class of character to match:
?
Optionality / 0 or 1Say we want to get just the phone numbers out of this vector but we notice that the phone numbers take on some different formats:
phone_numbers <- c("219 733 8965", "apple", "329-293-8753 ", "123banana", "595.794.7569", "3872876718")
str_view_all(phone_numbers, "\\d\\d\\d[ \\.-]?\\d\\d\\d[ \\.-]?\\d\\d\\d\\d")
The above regular expression matches the number parts of the phone numbers, which can be separated by zero or one space (`),
., or
-`.
+
1 -> infinityWe can use the +
expression to find words with one or more vowels:
fish <- c("gag grouper", "striper", "red drum", "cobia", "sheepshead")
str_view_all(fish, "[aeiuo]+")
*
0 -> infinityand the *
is zero or more`.
numbers <- c("0.2", "123.1", "547")
str_view_all(numbers, "\\d*\\.?\\d*")
# Regular expressions are greedy
letters <- "abcdefghijkc"
str_view_all(letters, "a.*c") # Greedy
str_view_all(letters, "a.*?c") # Lazy
()
: GroupingOne of the most powerful parts of regular expressions is grouping. Grouping allows us to split up our matched expressions and do more work with them. For example, we can create match the city and state in a set of addresses, splitting it into components:
addresses <- c("Santa Barbara, CA", "Seattle, WA", "New York, NY")
str_view_all(addresses, "([\\w\\s]+), (\\w+)")
str_match_all(addresses, "([\\w\\s]+), (\\w+)")
## [[1]]
## [,1] [,2] [,3]
## [1,] "Santa Barbara, CA" "Santa Barbara" "CA"
##
## [[2]]
## [,1] [,2] [,3]
## [1,] "Seattle, WA" "Seattle" "WA"
##
## [[3]]
## [,1] [,2] [,3]
## [1,] "New York, NY" "New York" "NY"
Once we use groups, ()
, we can also use back references to work with the result. Back references are \ and a number, where \1 is the first thing in (), \2 is the second thing in (), and so on.
str_replace(addresses, "([\\w\\s]+), (\\w+)", "City: \\1, State: \\2")
## [1] "City: Santa Barbara, State: CA" "City: Seattle, State: WA"
## [3] "City: New York, State: NY"
^
& $
It can also be really useful to make a say something like “strings that start with a capital letter” or “strings that end with a period”:
possible_sentences <- c(
"This might be a sentence.",
"So. Might. this",
"but this could maybe not be?",
"Am I a sentence?",
"maybe not",
"Regular expressions are useful!"
)
# ^ specifies the start so ^[A-z] means "starts with a capital letter""
str_detect(possible_sentences, "^[A-Z]")
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
possible_sentences[str_detect(possible_sentences, "^[A-Z]")]
## [1] "This might be a sentence." "So. Might. this"
## [3] "Am I a sentence?" "Regular expressions are useful!"
# We can also do "ends with a period"
str_detect(possible_sentences, "\\.$")
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
possible_sentences[str_detect(possible_sentences, "\\.$")]
## [1] "This might be a sentence."
# We can put them together:
str_detect(possible_sentences, "^[A-Z].*[\\.\\?!]$")
## [1] TRUE FALSE FALSE TRUE FALSE TRUE
possible_sentences[str_detect(possible_sentences, "^[A-Z].*[\\.\\?!]$")]
## [1] "This might be a sentence." "Am I a sentence?"
## [3] "Regular expressions are useful!"
Now that we’ve gone over some basics of regular expressions, let’s finish our example by splitting the various components of column x
into a year
, site
, and sub_site
column:
site_data
## x temp_c
## 1 2000-copany bay-2 9.247435
## 2 2001-choctawhatchee bay-2 29.170777
## 3 2002-aransas bay-3 62.351057
## 4 2003-choctawhatchee bay-4 89.888624
## 5 2004-choctawhatchee bay-4 96.958163
## 6 2005-choctawhatchee bay-2 49.894849
## 7 2006-choctawhatchee bay-4 53.401312
## 8 2007-galveston bay-2 54.335877
## 9 2008-choctawhatchee bay-1 7.279786
## 10 2009-copany bay-3 8.806454
## 11 2000-choctawhatchee bay-2 21.353557
## 12 2001-copany bay-1 3.229220
## 13 2002-galveston bay-4 71.312880
## 14 2003-choctawhatchee bay-4 42.640502
## 15 2004-copany bay-1 92.070634
## 16 2005-galveston bay-1 95.573717
## 17 2006-choctawhatchee bay-3 99.215221
## 18 2007-copany bay-1 80.570198
## 19 2008-galveston bay-1 98.582018
## 20 2009-galveston bay-2 46.406132
## 21 2000-galveston bay-1 39.235548
## 22 2001-choctawhatchee bay-2 95.831200
## 23 2002-copany bay-1 49.300697
## 24 2003-copany bay-3 29.875656
## 25 2004-copany bay-1 58.682873
## 26 2005-copany bay-1 19.943774
## 27 2006-galveston bay-4 55.101811
## 28 2007-galveston bay-4 77.270760
## 29 2008-copany bay-2 25.395214
## 30 2009-galveston bay-4 51.966997
## 31 2000-choctawhatchee bay-4 83.095610
## 32 2001-copany bay-2 40.698851
## 33 2002-galveston bay-3 24.973809
## 34 2003-choctawhatchee bay-1 44.232596
## 35 2004-galveston bay-3 59.023020
## 36 2005-choctawhatchee bay-4 59.439810
## 37 2006-copany bay-3 63.713113
## 38 2007-copany bay-1 63.446845
## 39 2008-choctawhatchee bay-2 12.215281
## 40 2009-galveston bay-4 24.810948
# I'll show you how to extract the year part
site_data$year <- str_extract(site_data$x, "\\d{4}")
# You do the rest
site_data$site <- str_extract(site_data$x, "") # <- Fill this in between the ""
site_data$plot <- str_extract(site_data$x, "") # <- Fill this in between the ""
grep
gsub
strsplit
stringr
packagestring_detect
string_match
string_replace
string_split
Data often come to us in strange forms and, before we can even begin analyzing the data, we have to do a lot of work to sanitize what we’ve been given. An example, which I just got the other week were temporal data with dates formatted like this:
dates <- c("1July17",
"02July2017",
"3July17",
"4July17")
and so on like that. Do you see how the day of the month and year are represented in different ways through the series? If we want to convert these strings into Date
objects for further analysis, we’ll to do some pre-cleaning before we can do that conversion. Regular expressions work great here.
str_match_all(dates, "(\\d{1,2})([A-Za-z]+)(\\d{2,4})")
## [[1]]
## [,1] [,2] [,3] [,4]
## [1,] "1July17" "1" "July" "17"
##
## [[2]]
## [,1] [,2] [,3] [,4]
## [1,] "02July2017" "02" "July" "2017"
##
## [[3]]
## [,1] [,2] [,3] [,4]
## [1,] "3July17" "3" "July" "17"
##
## [[4]]
## [,1] [,2] [,3] [,4]
## [1,] "4July17" "4" "July" "17"
That above regular expression was complex. Let’s break it down into its main parts. Below, I’ve re-formatted the data and the regular expression a bit so we can see what’s going on.
|---------------|---------------| ---------|
| 1 | July | 17 |
| 02 | July | 2017 |
| 3 | July | 17 |
| 4 | July | 17 |
|---------------|---------------| ---------|
| \\d{1,2} | [A-Za-z]+ | \\d{2,4} |
|---------------|---------------| ---------|
Here’s the code I used to generate the fake site_data data.frame
above.