Manipulating data at scale is challenging at best, particularly with tools like R that do most data management in memory. Relational databases are mature systems that easily handle storing, querying, and manipulating large, complex data sets.
A relational database management system (RDBMS) is a software system design to manage relations, or tabular, data. Several popular relational databases exist, including the following:
In this lesson, we will use postgres to learn to create, query, and manipulate relational database tables.
Creating a database in postgres is straightforward given the proper permissions using the createuser
and createdb
utilities that ship with postgres.
To create a user account, first log in as the postgres
user, and then create an account in postgres that matches the name of your shell account:
$ createuser jones -N -P -s -d
The -s
gives the account supeuser priviledges, while -d
gives the permission to create new databases. See createdb --help
for an explanation of all of the options.
Next, exit the postgres account and, from your local user account, cd into the lessons directory and then create the database:
$ cd ~/oss-lessons/sql-database
$ createdb oss
$ psql oss
psql (9.3.17)
Type "help" for help.
oss=#
This should open the psql
commandline utility for interacting with the database. The prompt should look something like oss=#
. From that prompt, you can then run both SQL and psql utilities to view and manipulate the database. For example:
oss=# \?
oss=# \pset pager
Pager usage is off.
oss=# \d
No relations found.
In the following block, we use the CREATE TABLE
command to create a table called sites
with three columns, a siteid
, altitude
, and habitat
. We also specify that the column siteid
will act as the primary key constraint on the table, which will automatically set up a UNIQUE
index on the siteid
column. Note how each of the columns has a data type which constrains the possible values for that column.
CREATE TABLE sites (
siteid INT8,
altitude INT8,
habitat VARCHAR(25),
CONSTRAINT site_pk PRIMARY KEY (siteid)
);
And the output will indicate that the command was run successfully with:
CREATE TABLE
Now, in psql
you can inspect the table that was created to see who owns it, its structure, and its contents:
oss=# \d
List of relations
Schema | Name | Type | Owner
--------+-------+-------+-------
public | sites | table | jones
(1 row)
oss=# \d sites
Table "public.sites"
Column | Type | Modifiers
----------+-----------------------+-----------
siteid | bigint | not null
altitude | bigint |
habitat | character varying(25) |
Indexes:
"site_pk" PRIMARY KEY, btree (siteid)
oss=# SELECT * FROM sites;
siteid | altitude | habitat
--------+----------+---------
(0 rows)
Not so interesting without any data in the table. We can insert a single row of data with the INSERT
sql command. Let’s insert a bunch (note that from here we are not showing the oss=#
prompt):
INSERT INTO sites (siteid, altitude, habitat) VALUES (1, 722, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (2, 805, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (3, 887, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (4, 920, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (5, 110, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (6, 192, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (7, 121, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (8, 108, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (9, 722, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (10, 805, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (11, 887, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (12, 920, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (13, 110, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (14, 192, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (15, 121, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (16, 108, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (17, 722, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (18, 805, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (19, 887, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (20, 920, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (21, 110, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (22, 192, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (23, 121, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (24, 108, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (25, 722, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (26, 805, 'riparian');
INSERT INTO sites (siteid, altitude, habitat) VALUES (27, 887, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (28, 920, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (29, 110, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (30, 192, 'mixed');
INSERT INTO sites (siteid, altitude, habitat) VALUES (31, 121, 'forest');
INSERT INTO sites (siteid, altitude, habitat) VALUES (32, 108, 'riparian');
And now we can run various queries to explore the data. The *
keyword is used to select all of the columns from the table.
-- SIMPLE SELECT
SELECT * from sites;
siteid | altitude | habitat
--------+----------+----------
1 | 722 | riparian
2 | 805 | riparian
3 | 887 | mixed
4 | 920 | mixed
5 | 110 | forest
6 | 192 | mixed
7 | 121 | forest
8 | 108 | riparian
9 | 722 | riparian
10 | 805 | riparian
11 | 887 | mixed
12 | 920 | mixed
13 | 110 | forest
14 | 192 | mixed
15 | 121 | forest
16 | 108 | riparian
17 | 722 | riparian
18 | 805 | riparian
19 | 887 | mixed
20 | 920 | mixed
21 | 110 | forest
22 | 192 | mixed
23 | 121 | forest
24 | 108 | riparian
25 | 722 | riparian
26 | 805 | riparian
27 | 887 | mixed
28 | 920 | mixed
29 | 110 | forest
30 | 192 | mixed
31 | 121 | forest
32 | 108 | riparian
(32 rows)
One can also select just a single column from the table such as habitat
, order the results using the ORDER BY
clause, and remove all of the duplicate results using the DISTINCT
clause:
SELECT habitat from sites;
SELECT habitat from sites order by habitat;
SELECT DISTINCT habitat from sites order by habitat;
habitat
----------
forest
mixed
riparian
(3 rows)
The power of SQL truly comes to the fore with the WHERE
clause, which allows you to choose which rows of the tables will be returned. A simple WHERE
clause will provide a comparator against one of the table variables, such as finding all sites where the altitude is greater than 500, or where the habitat is riparian
:
-- SELECT with a constraint
SELECT * from sites where altitude > 500;
siteid | altitude | habitat
--------+----------+----------
1 | 722 | riparian
2 | 805 | riparian
3 | 887 | mixed
4 | 920 | mixed
9 | 722 | riparian
10 | 805 | riparian
11 | 887 | mixed
12 | 920 | mixed
17 | 722 | riparian
18 | 805 | riparian
19 | 887 | mixed
20 | 920 | mixed
25 | 722 | riparian
26 | 805 | riparian
27 | 887 | mixed
28 | 920 | mixed
(16 rows)
SELECT * from sites where habitat = 'riparian';
siteid | altitude | habitat
--------+----------+----------
1 | 722 | riparian
2 | 805 | riparian
8 | 108 | riparian
9 | 722 | riparian
10 | 805 | riparian
16 | 108 | riparian
17 | 722 | riparian
18 | 805 | riparian
24 | 108 | riparian
25 | 722 | riparian
26 | 805 | riparian
32 | 108 | riparian
(12 rows)
The constraints can be logically combined to create poweful filters.
-- SELECT with a more complex constraint
SELECT * from sites where habitat = 'riparian' AND altitude > 500;
siteid | altitude | habitat
--------+----------+----------
1 | 722 | riparian
2 | 805 | riparian
9 | 722 | riparian
10 | 805 | riparian
17 | 722 | riparian
18 | 805 | riparian
25 | 722 | riparian
26 | 805 | riparian
(8 rows)
In addition, queries can aggregate values across the records that share some characteristic using the GROUP BY
clause. This is useful to calculate a count, sum, average, or maximum of the records in groups. For example, to count how many records are represented in each habitat
, use:
-- SELECT with an aggregation function using group by
SELECT habitat, count(*) from sites group by habitat order by habitat;
habitat | count
----------+-------
forest | 8
mixed | 12
riparian | 12
(3 rows)
Integrity constraints ensure that primary keys are unique, and that foreign keys in a child table are matched to a corresponding record in the parent table. In this example, because the siteid
with value 1
is already in the table, the INSERT
operation will fail.
-- INTEGRITY CONSTRAINTS: PRIMARY KEY ENFORCEMENT
INSERT INTO sites (siteid, altitude, habitat) VALUES (1, 721, 'scrub');
ERROR: duplicate key value violates unique constraint "site_pk"
DETAIL: Key (siteid)=(1) already exists.
Updating and deleting records occurs by selecting the records to be changed using a WHERE
clause within the containing UPDATE
or DELETE
.
-- UPDATE
UPDATE sites SET altitude=721 where siteid=1;
UPDATE 1
-- DELETE
INSERT INTO sites (siteid, altitude, habitat) VALUES (33, 121, 'forest');
SELECT count(*) from sites;
count
-------
33
(1 row)
DELETE from sites where siteid=33;
SELECT count(*) from sites;
count
-------
32
(1 row)
Both the UPDATE
and DELETE
commands need to be used with caution, as they can easily delete or change all of your data if they are run with a poorly constructed WHERE
clause. One way to protect yourself is to nest your commands in a TRANSACTION
, which holds the changes in a temporary store until you have verified them and instruct the system to COMMIT
the changes.
-- TRANSACTIONS
INSERT INTO sites (siteid, altitude, habitat) VALUES (33, 121, 'forest');
START TRANSACTION;
DELETE FROM sites WHERE siteid > 7;
DELETE 26
SELECT * from sites order by siteid;
siteid | altitude | habitat
--------+----------+----------
1 | 721 | riparian
2 | 805 | riparian
3 | 887 | mixed
4 | 920 | mixed
5 | 110 | forest
6 | 192 | mixed
7 | 121 | forest
(7 rows)
-- Oh Crap!
ROLLBACK;
SELECT * from sites WHERE siteid > 32 order by siteid;
siteid | altitude | habitat
--------+----------+---------
33 | 121 | forest
(1 row)
-- Try again to delete the right set of rows
START TRANSACTION;
DELETE FROM sites WHERE siteid > 32;
SELECT * from sites WHERE siteid > 32 order by siteid;
COMMIT;
Inserting data one row at a time can be tedious. The SQL COPY
command can be used to batch load large numbers of records from various file formats, including delimited files such as CSV files. Let’s create a new table called plotobs
, and load the data for that table from a CSV file.
-- CREATE ANOTHER TABLE THAT WILL BE RELATED TO site
CREATE TABLE plotobs (
obsid INT8,
siteid INT8,
plot CHAR(10),
date_sampled DATE,
sciname VARCHAR(100),
diameter NUMERIC,
condition VARCHAR(10),
CONSTRAINT plotobs_pk PRIMARY KEY (obsid),
CONSTRAINT plotobs_site_fk FOREIGN KEY (siteid) REFERENCES sites
);
-- Batch load data from a CSV file
COPY plotobs FROM '/home/jones/oss-lessons/sql-database/plotobs.csv' DELIMITER ',' CSV HEADER;
COPY 3296
And now we can show another aggregation query to calculate the average diameter by the scientific name of the organism.
-- ANOTHER AGGREGATION QUERY
SELECT sciname, avg(diameter) FROM plotobs GROUP BY sciname ORDER BY sciname;
sciname | avg
-----------------------+---------------------
Abies lasiocarpa | 19.4460416666666667
Jamesia americana | 5.5339062500000000
Picea engelmannii | 18.3617659883720930
Pseudotsuga menziesii | 18.3057226562500000
Sambucus racemosa | 10.4759375000000000
(5 rows)
Combining tables in order to use the data in one table in combination with data from another allows powerful aggregations. Most of the time, users will want to use an INNER JOIN in which two tables are joined using a common key, and in which the rows in the first table are matched to the rows in the second table with the same key value. One can also use the more rare but at times useful LEFT JOIN and the FULL OUTER JOIN.
In the next example, we show an INNER JOIN that is used to integrate the sites
table with the plotobs
table by joining on the siteid
key that is shared between the tables. The query returns one row for each siteid
that is on both tables, and associates the columns that are selected. In this example, the resulting table could be used to graph tree diameter
as a function of altitude
, possibly grouping by sciname
.
-- SIMPLE INNER JOIN
SELECT s.siteid, s.altitude, p.obsid, p.plot, p.sciname, p.diameter FROM sites s, plotobs p WHERE p.siteid = s.siteid;
siteid | altitude | obsid | plot | sciname | diameter
--------+----------+-------+------------+-----------------------+----------
1 | 721 | 1 | A | Abies lasiocarpa | 31.84
1 | 721 | 2 | A | Picea engelmannii | 3.21
1 | 721 | 3 | A | Picea engelmannii | 7.2
1 | 721 | 4 | A | Picea engelmannii | 11.62
1 | 721 | 5 | A | Picea engelmannii | 11.25
1 | 721 | 6 | A | Picea engelmannii | 13.16
1 | 721 | 7 | A | Picea engelmannii | 18.6
1 | 721 | 8 | A | Picea engelmannii | 23.62
1 | 721 | 9 | A | Picea engelmannii | 31.75
1 | 721 | 10 | A | Picea engelmannii | 33.27
1 | 721 | 11 | A | Picea engelmannii | 36.52
1 | 721 | 12 | A | Picea engelmannii | 60.59
1 | 721 | 13 | A | Sambucus racemosa | 3.83
1 | 721 | 14 | A | Pseudotsuga menziesii | 0.75
1 | 721 | 15 | A | Pseudotsuga menziesii | 17.51
1 | 721 | 16 | A | Pseudotsuga menziesii | 20.56
1 | 721 | 17 | A | Pseudotsuga menziesii | 49.51
1 | 721 | 18 | A | Jamesia americana | 1.8
...
32 | 108 | 3284 | B | Sambucus racemosa | 18.06
32 | 108 | 3285 | B | Pseudotsuga menziesii | 47.18
32 | 108 | 3286 | B | Pseudotsuga menziesii | 1.59
32 | 108 | 3287 | B | Pseudotsuga menziesii | 4.38
32 | 108 | 3288 | B | Pseudotsuga menziesii | 3.09
32 | 108 | 3289 | B | Jamesia americana | 7.98
32 | 108 | 3290 | B | Abies lasiocarpa | 10.85
32 | 108 | 3291 | B | Abies lasiocarpa | 13.55
32 | 108 | 3292 | B | Abies lasiocarpa | 17.26
32 | 108 | 3293 | B | Abies lasiocarpa | 21.65
32 | 108 | 3294 | B | Abies lasiocarpa | 17.8
32 | 108 | 3295 | B | Abies lasiocarpa | 23.4
32 | 108 | 3296 | B | Abies lasiocarpa | 25.79
Finally, a UNION
is used to concatenate the results of two or more queries, while INTERSECT
is used to subtract the results of a second query from the results of a first query.
-- UNION for concatenating results of two queries
SELECT * FROM plotobs WHERE diameter > 30
UNION
SELECT * FROM plotobs WHERE diameter < 10;
-- INTERSECT for finding common results between two queries
SELECT * FROM plotobs WHERE diameter > 34
INTERSECT
SELECT * FROM plotobs WHERE diameter < 35;
obsid | siteid | plot | date_sampled | sciname | diameter | condition
-------+--------+------------+--------------+-------------------+----------+-----------
2647 | 26 | B | 2011-06-13 | Picea engelmannii | 34.27 | dry
1967 | 20 | A | 2011-06-13 | Picea engelmannii | 34.16 | dry
1680 | 17 | A | 2012-06-10 | Abies lasiocarpa | 34.59 | normal
1205 | 12 | B | 2011-06-13 | Picea engelmannii | 34.33 | dry
1814 | 18 | B | 2011-06-13 | Abies lasiocarpa | 34.83 | normal
484 | 5 | B | 2011-06-13 | Picea engelmannii | 34.4 | dry
2070 | 21 | A | 2011-06-13 | Picea engelmannii | 34.56 | dry
1886 | 19 | A | 2012-06-10 | Abies lasiocarpa | 34.2 | normal
1720 | 17 | B | 2011-06-13 | Picea engelmannii | 34.16 | dry
341 | 4 | A | 2012-06-10 | Abies lasiocarpa | 34.5 | normal
825 | 9 | A | 2011-06-13 | Abies lasiocarpa | 34.19 | normal
2482 | 25 | A | 2011-06-13 | Picea engelmannii | 34.48 | dry
3050 | 30 | B | 2011-06-13 | Abies lasiocarpa | 34.93 | normal
2710 | 27 | A | 2012-06-10 | Abies lasiocarpa | 34.62 | normal
1761 | 18 | A | 2011-06-13 | Picea engelmannii | 34.9 | dry
2164 | 22 | A | 2011-06-13 | Abies lasiocarpa | 34.61 | normal
2235 | 22 | B | 2011-06-13 | Picea engelmannii | 34.07 | dry
3162 | 31 | B | 2011-06-13 | Picea engelmannii | 34.66 | dry
1608 | 16 | B | 2011-06-13 | Abies lasiocarpa | 34.29 | normal
238 | 3 | A | 2012-06-10 | Abies lasiocarpa | 34.83 | normal
175 | 2 | B | 2011-06-13 | Picea engelmannii | 34.02 | dry
1917 | 19 | B | 2011-06-13 | Abies lasiocarpa | 34.83 | normal
72 | 1 | B | 2011-06-13 | Picea engelmannii | 34.9 | dry
207 | 3 | A | 2011-06-13 | Abies lasiocarpa | 34.39 | normal
1474 | 15 | A | 2012-06-10 | Abies lasiocarpa | 34.74 | normal
216 | 3 | A | 2011-06-13 | Picea engelmannii | 34.71 | dry
2132 | 21 | B | 2011-06-13 | Picea engelmannii | 34.17 | dry
3225 | 32 | A | 2012-06-10 | Abies lasiocarpa | 34.22 | normal
1505 | 15 | B | 2011-06-13 | Abies lasiocarpa | 34.63 | normal
784 | 8 | B | 2011-06-13 | Abies lasiocarpa | 34.56 | normal
2473 | 25 | A | 2011-06-13 | Abies lasiocarpa | 34.33 | normal
(31 rows)