Join

Module Learning Objectives

By the end of this module, you will be able to:

  • Differentiate dplyr’s various join functions from each other
  • Use dplyr’s left_join, right_join, inner_join, full_join, and anti_join functions to manipulate two dataframes

Combining data

Now that we know how to manipulate a single dataframe, how do we manipulate multiple dataframes? If we have multiple sources of data and we want to combine them together into one dataframe or table, we can join them through any shared column(s)! Data you’ll be joining can be called “relational data”, because there is some kind of relationship between the dataframes that you’ll be leveraging. In the tidyverse, combining data that has a relationship is called “joining”. Let’s look at some of dplyr’s many join functions!

In each of the following join functions, you provide two dataframes, the one you arbitrarily provide first is called the “left” dataframe while the other is called the “right” dataframe. This is important because each of the different join functions brings the columns from one of the dataframes into the other depending on (1) which dataframe is left and which is right and (2) what type of join you specify.

This becomes somewhat more intuitive when looking at tangible examples so let’s prepare some data to join in different ways!

join Data Preparation

For demonstration purposes, let’s add a new column called record_number to our penguins data and call the new dataframe penguins_tidy. As you can see below, each row is now numbered from 1 to the length of the dataframe.

# Add a column called `record_number` to our penguins dataset
penguins_tidy <- penguins %>%
    dplyr::mutate(record_number = 1:n(), .before = dplyr::everything())

dplyr::glimpse(penguins_tidy)
Rows: 344
Columns: 9
$ record_number     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

The palmerpenguins package also has a “penguins_raw” dataset with additional, raw information on the same penguins, such as their sampling region, unique identifier, and the date when their nest was observed. Again, for demonstration purposes, let’s add a new column called record_number and call this new dataframe penguins_extra.

# Add a column called `record_number` to our raw penguins dataset
penguins_extra <- penguins_raw %>%
  dplyr::mutate(record_number = 1:n()) %>%
  # Also keep only desired columns to avoid unnecessary complexity
  dplyr::select(record_number, Region, `Individual ID`, `Date Egg`)

dplyr::glimpse(penguins_extra)
Rows: 344
Columns: 4
$ record_number   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ Region          <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "Anv…
$ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A1"…
$ `Date Egg`      <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 2007-…

Now that we have two dataframes that both have a column called record_number, we can join them together to combine information in various ways!

Also, note that if column names include spaces (as in Individual ID and Date Egg) they need to have a “backtick” (`) on either side. On your keyboard, a backtick (`) is on the left just below the “escape” key, and shares a button with the tilde (~).

left_join Example: Prioritize the “Left” Dataframe

Example

In a left_join, we bring the columns from the right dataframe that match rows found in the specified column(s) of the left dataframe.

We can specify the column that we want to join based on with by = .... If we don’t provide this argument, then dplyr will automatically join on all matching columns between the left and right dataframes. In our case, we want to left_join by record_number.

To better demonstrate that only rows found in the left dataframe will be joined from the right dataframe, we’ll use the pipe %>% to filter the left dataframe before joining.

# Left-join the two dataframes together on the shared column!
penguins_left_joined <- penguins_tidy %>%
  dplyr::filter(record_number < 5) %>%
  dplyr::left_join(y = penguins_extra, by = "record_number")

dplyr::glimpse(penguins_left_joined)
Rows: 4
Columns: 12
$ record_number     <int> 1, 2, 3, 4
$ species           <fct> Adelie, Adelie, Adelie, Adelie
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA
$ flipper_length_mm <int> 181, 186, 195, NA
$ body_mass_g       <int> 3750, 3800, 3250, NA
$ sex               <fct> male, female, female, NA
$ year              <int> 2007, 2007, 2007, 2007
$ Region            <chr> "Anvers", "Anvers", "Anvers", "Anvers"
$ `Individual ID`   <chr> "N1A1", "N1A2", "N2A1", "N2A2"
$ `Date Egg`        <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16

What we have in the end is penguins_left_joined, a dataframe with information from both penguins_tidy and penguins_extra! All of the rows in our filtered penguins_tidy are kept but only the rows from penguins_extra that have a matching record_number in penguins_tidy are included.

right_join Example: Prioritize the “Right” Dataframe

Example

In a right_join, we bring rows from the left dataframe into the right dataframe based on the values in the specified column(s) of the right dataframe.

As the names imply, a right_join is the opposite of a left_join.

inner_join Example: Keep Rows Found in Both Dataframes

Example

In an inner_join, we keep only the rows where the values in the column we are joining by are found in both dataframes.

This can be really useful when one of the dataframes includes supplementary data that has incomplete coverage on the other dataframe and you want to simultaneously combine the dataframes and remove the inevitable NAs that will be created.

For example, imagine that you have a dataframe of 100 study sites with information on plant growth and a second dataframe of soil chemistry information. Your grant budget was really tight though so you needed to prioritize sample processing and you only have soil chemistry for 20 of the sites where you have plant growth data.

If you use inner_join on your plant growth and soil chemistry datasets, you will create a single dataframe with both chemistry and plant data that only has the sites (i.e., rows) where you had data for both. This dataframe then would likely be ready for analysis because you’d have complete data for every site in the new joined dataframe!

Note that in an inner_join it doesn’t matter which dataframe is “left” and which is “right” because either way you’re only keeping the rows that are found in both dataframes.

full_join Example: Combine All Data in Both Dataframes

Example

In a full_join, we keep all values and all rows.

A full_join is “smart” enough to fill with NAs in all rows that don’t match between the two dataframes. Also, just like an inner_join, a full_join doesn’t care about which dataframe is “left” and which is “right” because all columns are getting combined regardless of which is left vs. right.

anti_join Example: Keep Only Columns that Aren’t Shared

Example

In an anti_join, we return rows of the left dataframe that do not have a match in the right dataframe. This can be used to see what will not be included in a join.

One case where an anti_join is particularly useful is that of “text mining” where you have one dataframe with a column of individual words that you’ve split apart from a larger block of free text. If you also have a dataframe of one column that contains words that you want to remove from your “actual” data (e.g., “and”, “not”, “I”, “me”, etc.), you can anti_join the two dataframes to quickly remove all of those unwanted words from your text mining dataframe.

Additional Notes

  • If we want to join by more than one matching column, we can specify multiple columns with a vector like so: by = c("column1", "column2").

  • We can also use a named vector, by = c("column_a" = "COLUMN_A") to match on columns that have different names in each dataframe.