Summarize

Module Learning Objectives

By the end of this module, you will be able to:

  • Describe the purpose of the pipe operator (%>%)
  • Use the pipe operator (%>%) to chain multiple functions together
  • Summarize data by using dplyr’s group_by and summarize functions

Pipe Operator (%>%)

Before diving into the tidyverse functions that allow for summarization and group-wise operations, let’s talk about the pipe operator (%>%). The pipe is from the magrittr package and allows chaining together multiple functions without needing to create separate objects at each step as you would have to without the pipe.

%>% Example: Using the Pipe

Example

As in the other chapters, let’s use the “penguins” data object found in the palmerpenguins package. Let’s say we want to keep only specimens that have a measurement for both bill length and bill depth and then remove the flipper and body mass columns.

Without the pipe–but still using other tidyverse functions–we could go about this like this:

# Filter out the NAs
penguins_v2 <- dplyr::filter(.data = penguins,
                              !is.na(bill_length_mm) & !is.na(bill_depth_mm))

# Now strip away the columns we don't want
penguins_v3 <- dplyr::select(.data = penguins_v2, 
                             -flipper_length_mm, -body_mass_g)

# And we can look at our final product with `base::head`
dplyr::glimpse(penguins_v3)
Rows: 342
Columns: 6
$ species        <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
$ island         <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, …
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 3…
$ bill_depth_mm  <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 1…
$ sex            <fct> male, female, female, female, male, female, male, NA, N…
$ year           <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…

Using the pipe though we can simplify this code dramatically! Note that each of the following lines must end with the %>% so that R knows there are more lines to consider.

# We begin with the name of the data object
penguins %>%
  # Then we can filter the data
  dplyr::filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) %>%
  # And strip away the columns we don't want
  dplyr::select(-flipper_length_mm, -body_mass_g) %>%
  # And we can even include the `glimpse` function to see our progress
  dplyr::glimpse()
Rows: 342
Columns: 6
$ species        <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie,…
$ island         <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, …
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 3…
$ bill_depth_mm  <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 1…
$ sex            <fct> male, female, female, female, male, female, male, NA, N…
$ year           <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…

Note that using the pipe allows each line to inherit the data created by the previous line.

Challenge: %>%

Your Turn!

Using pipes, filter the data to only include male penguins, select only the columns for species, island, and body mass, and filter out any rows with NA for body mass.

Aside: Fun History of Why %>% is a “Pipe”

The Belgian painter René Magritte famously created a painting titled “The Treachery of Images” featuring a depiction of a smoking pipe above the words “Cest ci n’est pas une pipe” (French for “This is not a pipe”). Magritte’s point was about how the depiction of a thing is not equal to thing itself. The magrittr package takes its name from the painter because it also includes a pipe that functions slightly differently from a command line pipe and uses different characters. Just like Magritte’s pipe, %>% both is and isn’t a pipe!

Group-Wise Summarizing

Now that we’ve covered the %>% operator we can use it to do group-wise summarization! Technically this summarization does not require the pipe but it does inherently have two steps and thus benefits from using the pipe to chain together those technically separate instructions.

To summarize by groups we first define our groups using dplyr’s group_by function and then summarize using summarize (also from dplyr). summarize does require you to specify what calculations you want to perform within your groups though it uses similar syntax to dplyr’s mutate function.

Despite the similarity in syntax between summarize and mutate there are a few crucial differences:

  • summarize returns only a single row per group while mutate returns as many rows as are in the original dataframe
  • summarize will automatically remove any columns that aren’t either (1) included in group_by or (2) created by summarize. mutate cannot remove columns so it only creates whatever you tell it to.

group_by + summarize Example: Summarize within Groups

Example

By using the %>% with group_by and summarize, we can calculate some summarized metric within our specified groups. To begin, let’s find the average bill depth within each species of penguin.

# Begin with the data and a pipe
penguins %>%
  # Group by the desired column names
  dplyr::group_by(species) %>%
  # And summarize in the way we desire
  dplyr::summarize(mean_bill_dep_mm = mean(bill_depth_mm, na.rm = TRUE) )
# A tibble: 3 × 2
  species   mean_bill_dep_mm
  <fct>                <dbl>
1 Adelie                18.3
2 Chinstrap             18.4
3 Gentoo                15.0

Notice how the resulting dataframe only contains one row per value in the group_by call and only includes the grouping column and the column we created (mean_bill_dep_mm)? This reduction in dimensions is an inherent property of summarize and can be intensely valuable but be careful you don’t accidentally remove columns that you want!

group_by + summarize Example: Calculate Multiple Metrics

Example

Let’s say we want to find multiple summary values for body mass of each species of penguin on each island. To accomplish this we can do the following:

# Begin with the data and a pipe
penguins %>%
  # Group by the desired column names
  dplyr::group_by(species, island) %>%
  # And summarize in the way we desire
  dplyr::summarize(
    # Get average body mass
    mean_mass_g = mean(body_mass_g, na.rm = TRUE),
    # Get the standard deviation
    sd_mass = sd(body_mass_g, na.rm = TRUE),
    # Count the number of individual penguins of each species at each island
    n_mass = dplyr::n(),
    # Calculate standard error from SD divided by count
    se_mass = sd_mass / sqrt(n_mass) )
# A tibble: 5 × 6
# Groups:   species [3]
  species   island    mean_mass_g sd_mass n_mass se_mass
  <fct>     <fct>           <dbl>   <dbl>  <int>   <dbl>
1 Adelie    Biscoe          3710.    488.     44    73.5
2 Adelie    Dream           3688.    455.     56    60.8
3 Adelie    Torgersen       3706.    445.     52    61.7
4 Chinstrap Dream           3733.    384.     68    46.6
5 Gentoo    Biscoe          5076.    504.    124    45.3

You can see that we also invoked the n function from dplyr to return the size of each group. This function reads any groups created by group_by and returns the count of rows in the dataframe for each group level.

Just like mutate, summarize will allow you to create as many columns as you want. So, if you want metrics calculated within your groups, you only need to define each of them within the summarize function.

Challenge: summarize

Your Turn!

Using what we’ve covered so far, find the average flipper length in each year (regardless of any other grouping variable).

Grouping Cautionary Note

group_by can be extremely useful in summarizing a dataframe or creating a new column without losing rows but you need to be careful. Objects created with group_by “remember” their groups until you change the groups or use the function ungroup from dplyr.

Look at how the output of a grouped data object tells you the number of groups in the output (see beneath this code chunk).

penguins %>%
  dplyr::group_by(species, island) %>%
  dplyr::summarize(penguins_count = dplyr::n())
# A tibble: 5 × 3
# Groups:   species [3]
  species   island    penguins_count
  <fct>     <fct>              <int>
1 Adelie    Biscoe                44
2 Adelie    Dream                 56
3 Adelie    Torgersen             52
4 Chinstrap Dream                 68
5 Gentoo    Biscoe               124

This means that all future uses of that pipe will continue to use the grouping established to create the “penguins_count” column. We can stop this by doing the same pipe, but adding ungroup after we’re done using the grouping established by group_by.

penguins %>%
  dplyr::group_by(species, island) %>%
  dplyr::summarize(penguins_count = dplyr::n()) %>%
  dplyr::ungroup()
# A tibble: 5 × 3
  species   island    penguins_count
  <fct>     <fct>              <int>
1 Adelie    Biscoe                44
2 Adelie    Dream                 56
3 Adelie    Torgersen             52
4 Chinstrap Dream                 68
5 Gentoo    Biscoe               124

See? We calculated with our desired groups but then dropped the grouping structure once we were finished with them. Note also that if you use group_by and do some calculation then re-group by something else by using group_by again, the second use of group_by will not be affected by the first. This means that you only need one ungroup per pipe.