Introduction

Continuing on from our earlier lesson on SQL, we’re going to learn how to work with SQL and database in R. There are two ways we’re going to go over:

  1. Directly executing SQL statements against a database
  2. Using dplyr’s seamless ability to work with SQL databases

Learning outcomes

Students should:

Lesson

Work with SQL directly using the DBI package

This content is largely adapted from the DBI homepage.

Before we can query a database for information, we have to connect to it. In the chunk below, we connect to a temporary, in-memory database.

library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

The above command creates an empty database with no tables. We can confirm this is the case with dbListTables():

dbListTables(con)
## character(0)

Before we can do any useful querying, we need to load data into our database. Let’s load a data.frame into a table:

dbWriteTable(con, "mtcars", mtcars)
dbListTables(con)
## [1] "mtcars"

We can immediately inspect what we just loaded like this:

dbListFields(con, "mtcars")
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
dbReadTable(con, "mtcars")
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Running queries is pretty simple:

dbGetQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 3  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 4  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 5  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 6  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 7  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 8  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 9  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 10 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 11 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Often times, when working with database, we won’t want to see all of the rows because there are just too many to store in memory. In fact, that’s often the point of using the database in the first place.

The DBI package lets us iterate through the rows in groups of our choosing so that memory usage stays low:

res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")

while (!dbHasCompleted(res)) {
  chunk <- dbFetch(res, n = 5)
  print(nrow(chunk))
}
## [1] 5
## [1] 5
## [1] 1
dbClearResult(res)

Notice the output: 5, 5, 1. Since we specified n = 5 in dbFetch, we get new rows in chunks of 5.

After we’re done, our script should close its connection:

dbDisconnect(con)

That’s the approach you’ll use.

Exercise: DBI and starwars

Instructions: Load the starwars dataset from the dplyr package into an in-memory SQLite database and execute a query against the table.

e.g. list just the Droids

library(DBI)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "starwars", select(starwars, -films, -vehicles, -starships))
             
# Your code here

Automatic SQL with dplyr

dplyr is already an amazing tool for analyzing data. But when we use it we’re usually using it against a data.frame (or something similar). Because it’s awesome, dplyr also supports working directly with database tables as if they were regular old data.frames.

All we have to do is use the tbl instead of, for example, our read.csv function call:

# Set up an SQLite database again, as before
library(dplyr)
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(DBI)

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "mtcars", mtcars)

# Now the real magic
mtcars_db <- tbl(con, "mtcars")
mtcars_db
## # Source:   table<mtcars> [?? x 11]
## # Database: sqlite 3.19.3 [:memory:]
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
##  2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
##  3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
##  4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
##  5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
##  6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
##  7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
##  8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
##  9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with more rows
mtcars_db %>% 
  select(mpg) %>% 
  show_query()
## <SQL>
## SELECT `mpg` AS `mpg`
## FROM `mtcars`
mpg_query <- mtcars_db %>% 
  group_by(cyl) %>% 
  summarize(mean(mpg))

mpg_query %>% 
  show_query()
## <SQL>
## SELECT `cyl`, AVG(`mpg`) AS `mean(mpg)`
## FROM `mtcars`
## GROUP BY `cyl`
# To get the data out of our query, we use collect()
class(mpg_query)
## [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl"
mpg_query %>% 
  collect() %>% 
  class()
## [1] "tbl_df"     "tbl"        "data.frame"

There, that’s it. dbplyr automatically converts our dplyr code into SQL queries behind the scenes and we get to use the usual dplyr functionality. Awesome!

***Exercise: Connect to the SQLite database at /home/mecum/oss/ramlegacy.db on Aurora and calculate:

  • Mean catch by stock ID
  • Start and end year of each assessment
library(dplyr)
library(dbplyr)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), "/home/mecum/oss/ramlegacy.db")
dbListTables(con)

# Mean catch per stock
# Your code here

# Time range of assessment per stock
# Your code here

Accessing an Microsoft Access database in R

Working with Microsoft Access databases in R is notoriously tricky. As far as I can tell, the only way to make it work is to be running 32-bit Windows. Most people are not running 32-bit Windows.

If you are, the basic incantation is:

install.packages('RODBC')
library(RODBC)
db <- odbcConnectAccess("yourAccessDB.db")

And then you can send queries with odbcQuery()

Summary

Solutions

library(DBI)
library(dplyr)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "starwars", select(starwars, -films, -vehicles, -starships))

dbGetQuery(con, "SELECT * FROM starwars WHERE species = 'Droid'")

dbDisconnect(con)
library(dplyr)
library(dbplyr)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), "/home/mecum/oss/ramlegacy.db")
dbListTables(con)

assessments <- tbl(con, "assessments")
timeseries <- tbl(con, "timeseries")

# Mean catch per stock
left_join(timeseries, assessments, "stockid") %>% 
  group_by(stockid) %>% 
  summarise(meanvalue = mean(tsvalue))

# Time range of assessment per stock
left_join(timeseries, assessments, "stockid") %>% 
  group_by(stockid) %>% 
  summarise(year_start = min(tsyear),
            year_end = max(tsyear))