Newbie's Guide to Dplyr & Tidyr - R Tidyverse

Photo by shawnanggg on Unsplash

Newbie's Guide to Dplyr & Tidyr - R Tidyverse

Dplyr & Tidyr - your best friends when it comes to data cleaning and data manipulation in R

·

17 min read

Hello again everyone! First off, let me just say Happy Chinese New Year! If you couldn't tell, I'm of Chinese ethnicity so the holidays are coming so, I thought it would be a good time for us to finalize a chapter in our learning journey with R. The chapter called "Data Manipulation". So far, I have talked about the basics of programming with R such as data structures, functions, data types and data manipulation with base R. But one of the reason why R is so strong in data manipulation is thanks to the Tidyverse.

The tidyverse is a collection of R packages specifically designed for Data Science and all the packages are compatible with one another, making it easy to do everything from data cleaning & manipulation to data visualisation.

Some packages that are included in the tidyverse are:

  • dplyr (data manipulation)
  • tidyr (tidying data)
  • readr (reading rectangular data sets)
  • purrr (for functions and vectors)
  • stringr (working with strings)
  • forcats (working with factors)
  • tibble (a re-imagining of the data frame)
  • ggplot2 (data visualization) <- the most important aspect of what makes R so powerful

There are also other minor packages included but now you know how complete the tidyverse is. For my series, I will only cover 3 packages which are dplyr, tidyr & ggplot2. For this specific post, I will only be covering Dplyr as there is A LOT of useful stuff within this package alone.

Installation packages

Let's start off by installing the tidyverse into RStudio. To install any package into Rstudio, just type out the syntax below and R will automatically retrieve the package from its online server and install for you.

install.packages('package_name')

So to install the tidyverse, just type the below code into the Rstudio console and press Enter. It might take a few minutes (depending on your internet and computer speed) so be patient. You will know when it's done when the arrow > appears in the console again.

install.packages('tidyverse')

Loading packages

Now that we downloaded and installed the tidyverse, we have to tell Rstudio to load the packages we need so that we can use its powerful functions. To do that, we type the below syntax:

library(package_name)

Notice that when we install packages, we put the package name as a string with the quotation marks but when we load the packages, there is no need for quotation marks.

Okay, for this step, let me clarify... we could technically load the whole tidyverse into Rstudio by typing library(tidyverse) but I would recommend that you only load in the packages that you actually need instead of the whole collection as it may use up more of your RAM which may also cause performance to slow down on your computer. For now, let's just load the dplyr package.

library(dplyr)

Great job so far!

Dplyr

Dplyr is the package specialized for data manipulation and it also provides a consistent and simple set of verbs and syntax to solve most of our data manipulation needs.

For this post, we will be covering the functions which will be used most of the time during our data science journey.

  • The pipe operator %>%
  • filter()
  • slice()
  • arrange()
  • select()
  • rename()
  • distinct()
  • mutate()
  • summarise() & group_by()
  • sample_n() & sample_frac()

Before that, let's install a data package as our example data. It's called nycflights13 and this data set contains flight data of all NYC flights in 2013.

install.packages('nycflights13')
library(nycflights13)
summary(flights)

You should see a summary of all the columns within the flights data set, I won't paste it here as it will take up too much space.

filter()

First off, let's start with the filter function. The reason why I switched it around is that it will be easier to teach the pipe operator after you know at least one or two functions first. As the name implies, the filter() function will filter the rows within a data set based on the criterias given to it, sounds simple enough right? Because it is! Let's use it in an example.

Using the flights data set, let's say I only want to get the flights that were flown in April. For this case, we will need to filter the 'month' column for the number 4 for April. The cool thing about the tidyverse is that the first argument of ALL of the functions in the tidyverse is the data set that you want to use, so to filter for April flights, we just type in the code below:

aprFlights <- filter(flights,month == 4)
head(aprFlights,3)

# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     4     1      454            500        -6      636            640        -4 US        1843 N566UW  EWR    CLT         84      529     5
2  2013     4     1      509            515        -6      743            814       -31 UA        1545 N76288  EWR    IAH        194     1400     5
3  2013     4     1      526            530        -4      812            827       -15 UA        1714 N76517  LGA    IAH        206     1416     5
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

It's as simple as that, you can also put as many conditions as you want in a single filter function, for example, I want to filter all April flights that departed from JFK under American Airlines (AA).

aprFlights <- filter(flights,month == 4, origin == 'JFK', carrier == 'AA')
head(aprFlights,3)
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     4     1      534            540        -6      833            850       -17 AA        1141 N5DSAA  JFK    MIA        152     1089     5
2  2013     4     1      624            629        -5     1025           1039       -14 AA         413 N3CPAA  JFK    SJU        209     1598     6
3  2013     4     1      708            715        -7     1027           1045       -18 AA         443 N5CRAA  JFK    MIA        160     1089     7
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

Here's what it would look like if we were to do it in base R.

aprFlights <- flights[flights$month == 4 & flights$origin == 'JFK' & flights$carrier == 'AA',]

slice()

The slice() function basically just allows us to slice rows of data.

slice(flights, 25:27)
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     1     1      607            607         0      858            915       -17 UA        1077 N53442  EWR    MIA        157     1085     6
2  2013     1     1      608            600         8      807            735        32 MQ        3768 N9EAMQ  EWR    ORD        139      719     6
3  2013     1     1      611            600        11      945            931        14 UA         303 N532UA  JFK    SFO        366     2586     6
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

arrange()

The arrange() functions, as the name suggests, allows us to arrange our data in ascending or descending order based on one or multiple columns.

head(arrange(flights, air_time, carrier),3)
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     1    16     1355           1315        40     1442           1411        31 EV        4368 N16911  EWR    BDL         20      116    13
2  2013     4    13      537            527        10      622            628        -6 EV        4631 N12167  EWR    BDL         20      116     5
3  2013     3     8     2026           1935        51     2131           2056        35 9E        3650 N8501F  JFK    PHL         21       94    19
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

select()

The select() function allows us to only call out certain columns as sometimes we may be working with super wide data sets with tens of columns and it makes it see what we want to see.

head(select(flights,carrier,month,arr_time),3)
# A tibble: 3 x 3
  carrier month arr_time
  <chr>   <int>    <int>
1 UA          1      830
2 UA          1      850
3 AA          1      923

The pipe operator %>%

Now, on to the most important aspect of dplyr which is the pipe operator (%>%). By using the pipe operator, it not only allows us to skip the first argument for every function (which is the data set), it also allows us to write our code in a flow which makes it a lot easier to read. Let me explain what I mean.

For example, let's say we want to see July flights, that depart from LGA, sorted by the air_time and flight number, and we only want to see the top 3 rows of the carrier and tailnum columns. If we were to code all this into a single line without using the pipe operator, our code would look something like this:

head(select(arrange(filter(flights, month == 7, origin == 'LGA'), air_time, flight), carrier, tailnum), 3)
# A tibble: 3 x 2
  carrier tailnum
  <chr>   <chr>  
1 US      N959UW 
2 US      N944UW 
3 US      N958UW

Gives you a headache just looking at it right? Using the base syntax, you essentially have to code it from the inside and slowly expand outwards. Now let's use the pipe operator and create a flow for the operations we want to perform.

# First we call the data set, then filter, then arrange, then select, then get the top 3 rows
flights %>% filter(month == 7, origin == 'LGA') %>% arrange(air_time, flight) %>%
  select(carrier, tailnum) %>% head(3)
# A tibble: 3 x 2
  carrier tailnum
  <chr>   <chr>  
1 US      N959UW 
2 US      N944UW 
3 US      N958UW

Let's go through the code together. Firstly, we ask R to take the flights data set, then we filter for July and LGA origin, then we arrange by air_time and flight number, then we only select the carrier and tail number and lastly, we only want the top 3 rows.

As you can see, it's much easier to read and you know what arguments are used for each function without having to count the brackets if you were to use the base R method.

PRO TIP: Use ctrl + shift + m as a shortcut to generate a pipe operator so you don't have to type it in manually.

rename()

The rename() function allows us to rename columns in a data set, however this renaming does not alter the original data set, it is only temporary unless you assign it to a variable.

aprFlights <- flights %>% filter(month == 4) %>% rename(company = carrier)
head(aprFlights,3)
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay company flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     4     1      454            500        -6      636            640        -4 US        1843 N566UW  EWR    CLT         84      529     5
2  2013     4     1      509            515        -6      743            814       -31 UA        1545 N76288  EWR    IAH        194     1400     5
3  2013     4     1      526            530        -4      812            827       -15 UA        1714 N76517  LGA    IAH        206     1416     5
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

As you can see, the courier column name has been changed to company instead. But if you call out the original flights data set, it will still be called carrier.

distinct()

The distinct() function allows us to extract a list of unique values within a column in a data set. It is usually used together with select() but you can still use distinct() without it.

flights %>% distinct(carrier)
# A tibble: 16 x 1
   carrier
   <chr>  
 1 UA     
 2 AA     
 3 B6     
 4 DL     
 5 EV     
 6 MQ     
 7 US     
 8 WN     
 9 VX     
10 FL     
11 AS     
12 9E     
13 F9     
14 HA     
15 YV     
16 OO

mutate()

The mutate() function allows us to create a new column or replace the values in an existing column and this is particularly useful when it comes to feature engineering (which means creating our own variables using the existing data in our data set).

flights %>% mutate(flight_delay = arr_delay - dep_delay) %>%
  select(dep_delay, arr_delay, flight_delay) %>% head(3)
# A tibble: 3 x 3
  dep_delay arr_delay flight_delay
      <dbl>     <dbl>        <dbl>
1         2        11            9
2         4        20           16
3         2        33           31

flights %>% mutate(year = 2022) %>% head(3)
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <dbl> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2022     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5
2  2022     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5
3  2022     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

summarise() & group_by()

The summarise() function will allow us to quickly collapse data frames into a single rows that aggregate results like sum() or mean(). Remember to use the na.rm function to remove any NA values from your calculation.

flights %>% summarise(avg_dep_delay = mean(dep_delay,na.rm = T))
# A tibble: 1 x 1
  avg_dep_delay
          <dbl>
1          12.6

In order to generate aggregate values for every unique value, we can use the group_by() function to essentially tell R to group a variable first then find the sum / average for each one.

flights %>% group_by(carrier) %>% summarise(avg_dep_delay = mean(dep_delay,na.rm = T))
# A tibble: 16 x 2
   carrier avg_dep_delay
   <chr>           <dbl>
 1 9E              16.7 
 2 AA               8.59
 3 AS               5.80
 4 B6              13.0 
 5 DL               9.26
 6 EV              20.0 
 7 F9              20.2 
 8 FL              18.7 
 9 HA               4.90
10 MQ              10.6 
11 OO              12.6 
12 UA              12.1 
13 US               3.78
14 VX              12.9 
15 WN              17.7 
16 YV              19.0

sample_n() & sample_frac()

And lastly, we have sample_n() and sample_frac(). These two functions will take random samples from our data set based on the conditions given. If you didn't figure it out from the names, sample_n() will take a number of samples while sample_frac() will take a fraction of samples from our data set.

# 5 random samples
sample_n(flights,5)
# A tibble: 5 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     7    18      956           1000        -4     1233           1307       -34 UA         642 N557UA  JFK    SFO        313     2586    10
2  2013     8     4     1715           1716        -1     2054           2035        19 UA        1284 N37468  EWR    SFO        362     2565    17
3  2013    11    16     2241           2250        -9     2353              8       -15 B6        2002 N265JB  JFK    BUF         57      301    22
4  2013     8    10     1548           1543         5     1757           1809       -12 UA        1107 N81449  EWR    DEN        218     1605    15
5  2013     2    14     1654           1659        -5     2037           2046        -9 9E        3375 N928XJ  JFK    SAT        244     1587    16
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

#0.001% of samples
sample_frac(flights,0.00001)
# A tibble: 3 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
1  2013     5    23     1804           1730        34     2120           1935       105 DL        2331 N351NW  LGA    DTW         85      502    17
2  2013     3    29      720            725        -5     1004           1030       -26 AS          21 N524AS  EWR    SEA        329     2402     7
3  2013     3    27      818            810         8     1043           1030        13 FL         346 N974AT  LGA    ATL        115      762     8
# ... with 2 more variables: minute <dbl>, time_hour <dttm>

And that is all for the dplyr package! Good job for making it this far everyone! We just have one final hurdle to go through and that is the tidyr package.

Tidyr

Tidyr is the package specialized in helping us create tidy data sets. Now you may ask, what is a tidy data set? Tidy data is when we have a data set where every row is an observation and every column is a variable, this way the data is organized in such a way where every cell is a value for a specific variable of a specific observation. First, let's load the tidyr package into Rstudio.

library(tidyr)

Here are the the functions within the tidyr package:

  • gather()
  • spread()
  • separate()
  • unite()

Let's create our example data set where we have 3 different companies and their revenue for every quarter of the year.

comp <- c(1,1,1,2,2,2,3,3,3)
yr <- c(1998,1999,2000,1998,1999,2000,1998,1999,2000)
q1 <- runif(9, min=0, max=100)
q2 <- runif(9, min=0, max=100)
q3 <- runif(9, min=0, max=100)
q4 <- runif(9, min=0, max=100)

df <- data.frame(comp=comp,year=yr,Qtr1 = q1,Qtr2 = q2,Qtr3 = q3,Qtr4 = q4)
df
  comp year      Qtr1     Qtr2     Qtr3       Qtr4
1    1 1998 41.587617 18.58697 35.86367 61.8031462
2    1 1999 47.498490 57.31370 62.68025  0.2821631
3    1 2000 13.168443 79.51198 68.59027 10.7751459
4    2 1998  5.367519 93.74990 32.95184 62.3821434
5    2 1999  7.877658 77.81931 82.72293 46.6115526
6    2 2000 10.884134 38.62998 83.77137 31.0145831
7    3 1998  7.898336 63.38837 98.83824 26.7077957
8    3 1999 99.241663 95.80608 28.95033 78.6772148
9    3 2000 50.199691 68.39942 10.13243 19.2787576

With a data set like this, it would be complicated to get the sum or mean of all the companies because we would need to add up 4 different columns. So, we need to collapse the 4 revenue columns into a single revenue column.

gather()

This is where gather comes in. gather() will collapse multiple columns into key-pair values which in this case would be our variables Qtr1 to Qtr4.

df <- df %>% gather(Quarter,Revenue,Qtr1:Qtr4)
head(df,12)
   comp year Quarter   Revenue
1     1 1998    Qtr1 41.587617
2     1 1999    Qtr1 47.498490
3     1 2000    Qtr1 13.168443
4     2 1998    Qtr1  5.367519
5     2 1999    Qtr1  7.877658
6     2 2000    Qtr1 10.884134
7     3 1998    Qtr1  7.898336
8     3 1999    Qtr1 99.241663
9     3 2000    Qtr1 50.199691
10    1 1998    Qtr2 18.586970
11    1 1999    Qtr2 57.313699
12    1 2000    Qtr2 79.511981

So let's break down the syntax for this. The arguments for gather are (dataframe, name of column for the title of the variables to collapse, name of column for the values of the variables to collapse, which columns to collapse).

spread()

Now we have spread() and this is basically the opposite of what gather() does. So instead of collapsing multiple variables, it will spread a single variable into multiple columns based on the unique values of the variable.

df <- df %>% spread(Quarter, Revenue)
df
  comp year      Qtr1     Qtr2     Qtr3       Qtr4
1    1 1998 41.587617 18.58697 35.86367 61.8031462
2    1 1999 47.498490 57.31370 62.68025  0.2821631
3    1 2000 13.168443 79.51198 68.59027 10.7751459
4    2 1998  5.367519 93.74990 32.95184 62.3821434
5    2 1999  7.877658 77.81931 82.72293 46.6115526
6    2 2000 10.884134 38.62998 83.77137 31.0145831
7    3 1998  7.898336 63.38837 98.83824 26.7077957
8    3 1999 99.241663 95.80608 28.95033 78.6772148
9    3 2000 50.199691 68.39942 10.13243 19.2787576

##separate() The separate() function will turn a single character column into multiple columns based on a seperator.

df <- data.frame(x = c(NA, "a.x", "b.y", "c.z"))
df
     x
1 <NA>
2  a.x
3  b.y
4  c.z

df <- df %>% separate(x, c("ABC", "XYZ"))
df
   ABC  XYZ
1 <NA> <NA>
2    a    x
3    b    y
4    c    z

By default, R will automatically look for any non-alphanumeric values as the separator. You can let R know which separator to use by adding a 'sep' argument into it. Also, if the number of columns given to R is less than the separation performed, the remaining values will be discarded automatically. For example...

df <- data.frame(x = c(NA, "a.x.c", "b.y.k", "c.z.g"))
df
      x
1  <NA>
2 a.x.c
3 b.y.k
4 c.z.g

df <- df %>% separate(x, c("ABC", "XYZ"), sep = '.')
Warning message:
Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].
df
   ABC  XYZ
1 <NA> <NA>
2    a    x
3    b    y
4    c    z

unite()

And lastly, we have unite() which is the opposite of separate(). It combines multiple columns together into one and uses a separator (optional) in between.

df
   ABC  XYZ
1 <NA> <NA>
2    a    x
3    b    y
4    c    z

df <- df %>% unite("AtoZ", c("ABC", "XYZ"), sep = ";")
df
   AtoZ
1 NA;NA
2   a;x
3   b;y
4   c;z

Cheat Sheets

Sometimes it may be difficult to remember all this or sometimes there might be functions that you may not know that exists in the dplyr and tidyr package. Don't worry as there are cheat sheets available for that can let you easily refer to these functions.

And that is all for dplyr and tidyr! Woohoo! We are FINALLY done with the data cleaning, data manipulation & data wrangling part of our learning journey. However, please take your time to familiarize yourself with these concepts as you will be spending around 70% to 85% of your time actually doing this rather than the data visualization or machine learning aspect.

Great job making it this far! I am so proud of you and you should be proud of yourself too. Before we move onto the data visualization, I will go off on a tangent and talk about Git and Github which are a essential skill for ALL coders / programmers / developers / data scientists. I know... I've been dragging data visualization for a while but there's just too many things I think we need before we can move on and trust me, learning Git & Github will be worth it. It will only take a day or two for you to learn it, it's that simple!

Anyways, I will be off from posting new posts at the start of February because I will be spending time with my family during Chinese New Year. However, I'll see if I have the time to squeeze out the Git & Github post before that. But until then, see you guys next time!