Continuing on from our earlier lesson on SQL, we’re going to learn how to work with SQL and database in R. There are two ways we’re going to go over:
dplyr
’s seamless ability to work with SQL databasesStudents should:
dplyr
to a databaseDBI
packageThis content is largely adapted from the DBI homepage.
Before we can query a database for information, we have to connect to it. In the chunk below, we connect to a temporary, in-memory database.
library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
The above command creates an empty database with no tables. We can confirm this is the case with dbListTables()
:
dbListTables(con)
## character(0)
Before we can do any useful querying, we need to load data into our database. Let’s load a data.frame
into a table:
dbWriteTable(con, "mtcars", mtcars)
dbListTables(con)
## [1] "mtcars"
We can immediately inspect what we just loaded like this:
dbListFields(con, "mtcars")
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
dbReadTable(con, "mtcars")
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Running queries is pretty simple:
dbGetQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 11 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Often times, when working with database, we won’t want to see all of the rows because there are just too many to store in memory. In fact, that’s often the point of using the database in the first place.
The DBI
package lets us iterate through the rows in groups of our choosing so that memory usage stays low:
res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
while (!dbHasCompleted(res)) {
chunk <- dbFetch(res, n = 5)
print(nrow(chunk))
}
## [1] 5
## [1] 5
## [1] 1
dbClearResult(res)
Notice the output: 5, 5, 1. Since we specified n = 5
in dbFetch
, we get new rows in chunks of 5.
After we’re done, our script should close its connection:
dbDisconnect(con)
That’s the approach you’ll use.
Exercise: DBI and starwars
Instructions: Load the starwars dataset from the dplyr
package into an in-memory SQLite database and execute a query against the table.
e.g. list just the Droids
library(DBI)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "starwars", select(starwars, -films, -vehicles, -starships))
# Your code here
dplyr
is already an amazing tool for analyzing data. But when we use it we’re usually using it against a data.frame
(or something similar). Because it’s awesome, dplyr
also supports working directly with database tables as if they were regular old data.frame
s.
All we have to do is use the tbl
instead of, for example, our read.csv
function call:
# Set up an SQLite database again, as before
library(dplyr)
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "mtcars", mtcars)
# Now the real magic
mtcars_db <- tbl(con, "mtcars")
mtcars_db
## # Source: table<mtcars> [?? x 11]
## # Database: sqlite 3.19.3 [:memory:]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## # ... with more rows
mtcars_db %>%
select(mpg) %>%
show_query()
## <SQL>
## SELECT `mpg` AS `mpg`
## FROM `mtcars`
mpg_query <- mtcars_db %>%
group_by(cyl) %>%
summarize(mean(mpg))
mpg_query %>%
show_query()
## <SQL>
## SELECT `cyl`, AVG(`mpg`) AS `mean(mpg)`
## FROM `mtcars`
## GROUP BY `cyl`
# To get the data out of our query, we use collect()
class(mpg_query)
## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
mpg_query %>%
collect() %>%
class()
## [1] "tbl_df" "tbl" "data.frame"
There, that’s it. dbplyr
automatically converts our dplyr
code into SQL queries behind the scenes and we get to use the usual dplyr
functionality. Awesome!
***Exercise: Connect to the SQLite database at /home/mecum/oss/ramlegacy.db
on Aurora and calculate:
library(dplyr)
library(dbplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "/home/mecum/oss/ramlegacy.db")
dbListTables(con)
# Mean catch per stock
# Your code here
# Time range of assessment per stock
# Your code here
Working with Microsoft Access databases in R is notoriously tricky. As far as I can tell, the only way to make it work is to be running 32-bit Windows. Most people are not running 32-bit Windows.
If you are, the basic incantation is:
install.packages('RODBC')
library(RODBC)
db <- odbcConnectAccess("yourAccessDB.db")
And then you can send queries with odbcQuery()
dplyr
against an SQL database for intuitive querying that is fastlibrary(DBI)
library(dplyr)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "starwars", select(starwars, -films, -vehicles, -starships))
dbGetQuery(con, "SELECT * FROM starwars WHERE species = 'Droid'")
dbDisconnect(con)
library(dplyr)
library(dbplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "/home/mecum/oss/ramlegacy.db")
dbListTables(con)
assessments <- tbl(con, "assessments")
timeseries <- tbl(con, "timeseries")
# Mean catch per stock
left_join(timeseries, assessments, "stockid") %>%
group_by(stockid) %>%
summarise(meanvalue = mean(tsvalue))
# Time range of assessment per stock
left_join(timeseries, assessments, "stockid") %>%
group_by(stockid) %>%
summarise(year_start = min(tsyear),
year_end = max(tsyear))