Photo by shawnanggg on Unsplash
Newbie's Guide to Dplyr & Tidyr - R Tidyverse
Dplyr & Tidyr - your best friends when it comes to data cleaning and data manipulation in R
Hello again everyone! First off, let me just say Happy Chinese New Year! If you couldn't tell, I'm of Chinese ethnicity so the holidays are coming so, I thought it would be a good time for us to finalize a chapter in our learning journey with R. The chapter called "Data Manipulation". So far, I have talked about the basics of programming with R such as data structures, functions, data types and data manipulation with base R. But one of the reason why R is so strong in data manipulation is thanks to the Tidyverse.
The tidyverse is a collection of R packages specifically designed for Data Science and all the packages are compatible with one another, making it easy to do everything from data cleaning & manipulation to data visualisation.
Some packages that are included in the tidyverse are:
- dplyr (data manipulation)
- tidyr (tidying data)
- readr (reading rectangular data sets)
- purrr (for functions and vectors)
- stringr (working with strings)
- forcats (working with factors)
- tibble (a re-imagining of the data frame)
- ggplot2 (data visualization) <- the most important aspect of what makes R so powerful
There are also other minor packages included but now you know how complete the tidyverse is. For my series, I will only cover 3 packages which are dplyr, tidyr & ggplot2. For this specific post, I will only be covering Dplyr as there is A LOT of useful stuff within this package alone.
Installation packages
Let's start off by installing the tidyverse into RStudio. To install any package into Rstudio, just type out the syntax below and R will automatically retrieve the package from its online server and install for you.
install.packages('package_name')
So to install the tidyverse, just type the below code into the Rstudio console and press Enter. It might take a few minutes (depending on your internet and computer speed) so be patient. You will know when it's done when the arrow > appears in the console again.
install.packages('tidyverse')
Loading packages
Now that we downloaded and installed the tidyverse, we have to tell Rstudio to load the packages we need so that we can use its powerful functions. To do that, we type the below syntax:
library(package_name)
Notice that when we install packages, we put the package name as a string with the quotation marks but when we load the packages, there is no need for quotation marks.
Okay, for this step, let me clarify... we could technically load the whole tidyverse into Rstudio by typing library(tidyverse) but I would recommend that you only load in the packages that you actually need instead of the whole collection as it may use up more of your RAM which may also cause performance to slow down on your computer. For now, let's just load the dplyr package.
library(dplyr)
Great job so far!
Dplyr
Dplyr is the package specialized for data manipulation and it also provides a consistent and simple set of verbs and syntax to solve most of our data manipulation needs.
For this post, we will be covering the functions which will be used most of the time during our data science journey.
- The pipe operator %>%
- filter()
- slice()
- arrange()
- select()
- rename()
- distinct()
- mutate()
- summarise() & group_by()
- sample_n() & sample_frac()
Before that, let's install a data package as our example data. It's called nycflights13 and this data set contains flight data of all NYC flights in 2013.
install.packages('nycflights13')
library(nycflights13)
summary(flights)
You should see a summary of all the columns within the flights data set, I won't paste it here as it will take up too much space.
filter()
First off, let's start with the filter function. The reason why I switched it around is that it will be easier to teach the pipe operator after you know at least one or two functions first. As the name implies, the filter() function will filter the rows within a data set based on the criterias given to it, sounds simple enough right? Because it is! Let's use it in an example.
Using the flights data set, let's say I only want to get the flights that were flown in April. For this case, we will need to filter the 'month' column for the number 4 for April. The cool thing about the tidyverse is that the first argument of ALL of the functions in the tidyverse is the data set that you want to use, so to filter for April flights, we just type in the code below:
aprFlights <- filter(flights,month == 4)
head(aprFlights,3)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 4 1 454 500 -6 636 640 -4 US 1843 N566UW EWR CLT 84 529 5
2 2013 4 1 509 515 -6 743 814 -31 UA 1545 N76288 EWR IAH 194 1400 5
3 2013 4 1 526 530 -4 812 827 -15 UA 1714 N76517 LGA IAH 206 1416 5
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
It's as simple as that, you can also put as many conditions as you want in a single filter function, for example, I want to filter all April flights that departed from JFK under American Airlines (AA).
aprFlights <- filter(flights,month == 4, origin == 'JFK', carrier == 'AA')
head(aprFlights,3)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 4 1 534 540 -6 833 850 -17 AA 1141 N5DSAA JFK MIA 152 1089 5
2 2013 4 1 624 629 -5 1025 1039 -14 AA 413 N3CPAA JFK SJU 209 1598 6
3 2013 4 1 708 715 -7 1027 1045 -18 AA 443 N5CRAA JFK MIA 160 1089 7
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
Here's what it would look like if we were to do it in base R.
aprFlights <- flights[flights$month == 4 & flights$origin == 'JFK' & flights$carrier == 'AA',]
slice()
The slice() function basically just allows us to slice rows of data.
slice(flights, 25:27)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 607 607 0 858 915 -17 UA 1077 N53442 EWR MIA 157 1085 6
2 2013 1 1 608 600 8 807 735 32 MQ 3768 N9EAMQ EWR ORD 139 719 6
3 2013 1 1 611 600 11 945 931 14 UA 303 N532UA JFK SFO 366 2586 6
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
arrange()
The arrange() functions, as the name suggests, allows us to arrange our data in ascending or descending order based on one or multiple columns.
head(arrange(flights, air_time, carrier),3)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 16 1355 1315 40 1442 1411 31 EV 4368 N16911 EWR BDL 20 116 13
2 2013 4 13 537 527 10 622 628 -6 EV 4631 N12167 EWR BDL 20 116 5
3 2013 3 8 2026 1935 51 2131 2056 35 9E 3650 N8501F JFK PHL 21 94 19
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
select()
The select() function allows us to only call out certain columns as sometimes we may be working with super wide data sets with tens of columns and it makes it see what we want to see.
head(select(flights,carrier,month,arr_time),3)
# A tibble: 3 x 3
carrier month arr_time
<chr> <int> <int>
1 UA 1 830
2 UA 1 850
3 AA 1 923
The pipe operator %>%
Now, on to the most important aspect of dplyr which is the pipe operator (%>%). By using the pipe operator, it not only allows us to skip the first argument for every function (which is the data set), it also allows us to write our code in a flow which makes it a lot easier to read. Let me explain what I mean.
For example, let's say we want to see July flights, that depart from LGA, sorted by the air_time and flight number, and we only want to see the top 3 rows of the carrier and tailnum columns. If we were to code all this into a single line without using the pipe operator, our code would look something like this:
head(select(arrange(filter(flights, month == 7, origin == 'LGA'), air_time, flight), carrier, tailnum), 3)
# A tibble: 3 x 2
carrier tailnum
<chr> <chr>
1 US N959UW
2 US N944UW
3 US N958UW
Gives you a headache just looking at it right? Using the base syntax, you essentially have to code it from the inside and slowly expand outwards. Now let's use the pipe operator and create a flow for the operations we want to perform.
# First we call the data set, then filter, then arrange, then select, then get the top 3 rows
flights %>% filter(month == 7, origin == 'LGA') %>% arrange(air_time, flight) %>%
select(carrier, tailnum) %>% head(3)
# A tibble: 3 x 2
carrier tailnum
<chr> <chr>
1 US N959UW
2 US N944UW
3 US N958UW
Let's go through the code together. Firstly, we ask R to take the flights data set, then we filter for July and LGA origin, then we arrange by air_time and flight number, then we only select the carrier and tail number and lastly, we only want the top 3 rows.
As you can see, it's much easier to read and you know what arguments are used for each function without having to count the brackets if you were to use the base R method.
PRO TIP: Use ctrl + shift + m as a shortcut to generate a pipe operator so you don't have to type it in manually.
rename()
The rename() function allows us to rename columns in a data set, however this renaming does not alter the original data set, it is only temporary unless you assign it to a variable.
aprFlights <- flights %>% filter(month == 4) %>% rename(company = carrier)
head(aprFlights,3)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay company flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 4 1 454 500 -6 636 640 -4 US 1843 N566UW EWR CLT 84 529 5
2 2013 4 1 509 515 -6 743 814 -31 UA 1545 N76288 EWR IAH 194 1400 5
3 2013 4 1 526 530 -4 812 827 -15 UA 1714 N76517 LGA IAH 206 1416 5
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
As you can see, the courier column name has been changed to company instead. But if you call out the original flights data set, it will still be called carrier.
distinct()
The distinct() function allows us to extract a list of unique values within a column in a data set. It is usually used together with select() but you can still use distinct() without it.
flights %>% distinct(carrier)
# A tibble: 16 x 1
carrier
<chr>
1 UA
2 AA
3 B6
4 DL
5 EV
6 MQ
7 US
8 WN
9 VX
10 FL
11 AS
12 9E
13 F9
14 HA
15 YV
16 OO
mutate()
The mutate() function allows us to create a new column or replace the values in an existing column and this is particularly useful when it comes to feature engineering (which means creating our own variables using the existing data in our data set).
flights %>% mutate(flight_delay = arr_delay - dep_delay) %>%
select(dep_delay, arr_delay, flight_delay) %>% head(3)
# A tibble: 3 x 3
dep_delay arr_delay flight_delay
<dbl> <dbl> <dbl>
1 2 11 9
2 4 20 16
3 2 33 31
flights %>% mutate(year = 2022) %>% head(3)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<dbl> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2022 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5
2 2022 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5
3 2022 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
summarise() & group_by()
The summarise() function will allow us to quickly collapse data frames into a single rows that aggregate results like sum() or mean(). Remember to use the na.rm function to remove any NA values from your calculation.
flights %>% summarise(avg_dep_delay = mean(dep_delay,na.rm = T))
# A tibble: 1 x 1
avg_dep_delay
<dbl>
1 12.6
In order to generate aggregate values for every unique value, we can use the group_by() function to essentially tell R to group a variable first then find the sum / average for each one.
flights %>% group_by(carrier) %>% summarise(avg_dep_delay = mean(dep_delay,na.rm = T))
# A tibble: 16 x 2
carrier avg_dep_delay
<chr> <dbl>
1 9E 16.7
2 AA 8.59
3 AS 5.80
4 B6 13.0
5 DL 9.26
6 EV 20.0
7 F9 20.2
8 FL 18.7
9 HA 4.90
10 MQ 10.6
11 OO 12.6
12 UA 12.1
13 US 3.78
14 VX 12.9
15 WN 17.7
16 YV 19.0
sample_n() & sample_frac()
And lastly, we have sample_n() and sample_frac(). These two functions will take random samples from our data set based on the conditions given. If you didn't figure it out from the names, sample_n() will take a number of samples while sample_frac() will take a fraction of samples from our data set.
# 5 random samples
sample_n(flights,5)
# A tibble: 5 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 7 18 956 1000 -4 1233 1307 -34 UA 642 N557UA JFK SFO 313 2586 10
2 2013 8 4 1715 1716 -1 2054 2035 19 UA 1284 N37468 EWR SFO 362 2565 17
3 2013 11 16 2241 2250 -9 2353 8 -15 B6 2002 N265JB JFK BUF 57 301 22
4 2013 8 10 1548 1543 5 1757 1809 -12 UA 1107 N81449 EWR DEN 218 1605 15
5 2013 2 14 1654 1659 -5 2037 2046 -9 9E 3375 N928XJ JFK SAT 244 1587 16
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
#0.001% of samples
sample_frac(flights,0.00001)
# A tibble: 3 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 5 23 1804 1730 34 2120 1935 105 DL 2331 N351NW LGA DTW 85 502 17
2 2013 3 29 720 725 -5 1004 1030 -26 AS 21 N524AS EWR SEA 329 2402 7
3 2013 3 27 818 810 8 1043 1030 13 FL 346 N974AT LGA ATL 115 762 8
# ... with 2 more variables: minute <dbl>, time_hour <dttm>
And that is all for the dplyr package! Good job for making it this far everyone! We just have one final hurdle to go through and that is the tidyr package.
Tidyr
Tidyr is the package specialized in helping us create tidy data sets. Now you may ask, what is a tidy data set? Tidy data is when we have a data set where every row is an observation and every column is a variable, this way the data is organized in such a way where every cell is a value for a specific variable of a specific observation. First, let's load the tidyr package into Rstudio.
library(tidyr)
Here are the the functions within the tidyr package:
- gather()
- spread()
- separate()
- unite()
Let's create our example data set where we have 3 different companies and their revenue for every quarter of the year.
comp <- c(1,1,1,2,2,2,3,3,3)
yr <- c(1998,1999,2000,1998,1999,2000,1998,1999,2000)
q1 <- runif(9, min=0, max=100)
q2 <- runif(9, min=0, max=100)
q3 <- runif(9, min=0, max=100)
q4 <- runif(9, min=0, max=100)
df <- data.frame(comp=comp,year=yr,Qtr1 = q1,Qtr2 = q2,Qtr3 = q3,Qtr4 = q4)
df
comp year Qtr1 Qtr2 Qtr3 Qtr4
1 1 1998 41.587617 18.58697 35.86367 61.8031462
2 1 1999 47.498490 57.31370 62.68025 0.2821631
3 1 2000 13.168443 79.51198 68.59027 10.7751459
4 2 1998 5.367519 93.74990 32.95184 62.3821434
5 2 1999 7.877658 77.81931 82.72293 46.6115526
6 2 2000 10.884134 38.62998 83.77137 31.0145831
7 3 1998 7.898336 63.38837 98.83824 26.7077957
8 3 1999 99.241663 95.80608 28.95033 78.6772148
9 3 2000 50.199691 68.39942 10.13243 19.2787576
With a data set like this, it would be complicated to get the sum or mean of all the companies because we would need to add up 4 different columns. So, we need to collapse the 4 revenue columns into a single revenue column.
gather()
This is where gather comes in. gather() will collapse multiple columns into key-pair values which in this case would be our variables Qtr1 to Qtr4.
df <- df %>% gather(Quarter,Revenue,Qtr1:Qtr4)
head(df,12)
comp year Quarter Revenue
1 1 1998 Qtr1 41.587617
2 1 1999 Qtr1 47.498490
3 1 2000 Qtr1 13.168443
4 2 1998 Qtr1 5.367519
5 2 1999 Qtr1 7.877658
6 2 2000 Qtr1 10.884134
7 3 1998 Qtr1 7.898336
8 3 1999 Qtr1 99.241663
9 3 2000 Qtr1 50.199691
10 1 1998 Qtr2 18.586970
11 1 1999 Qtr2 57.313699
12 1 2000 Qtr2 79.511981
So let's break down the syntax for this. The arguments for gather are (dataframe, name of column for the title of the variables to collapse, name of column for the values of the variables to collapse, which columns to collapse).
spread()
Now we have spread() and this is basically the opposite of what gather() does. So instead of collapsing multiple variables, it will spread a single variable into multiple columns based on the unique values of the variable.
df <- df %>% spread(Quarter, Revenue)
df
comp year Qtr1 Qtr2 Qtr3 Qtr4
1 1 1998 41.587617 18.58697 35.86367 61.8031462
2 1 1999 47.498490 57.31370 62.68025 0.2821631
3 1 2000 13.168443 79.51198 68.59027 10.7751459
4 2 1998 5.367519 93.74990 32.95184 62.3821434
5 2 1999 7.877658 77.81931 82.72293 46.6115526
6 2 2000 10.884134 38.62998 83.77137 31.0145831
7 3 1998 7.898336 63.38837 98.83824 26.7077957
8 3 1999 99.241663 95.80608 28.95033 78.6772148
9 3 2000 50.199691 68.39942 10.13243 19.2787576
##separate() The separate() function will turn a single character column into multiple columns based on a seperator.
df <- data.frame(x = c(NA, "a.x", "b.y", "c.z"))
df
x
1 <NA>
2 a.x
3 b.y
4 c.z
df <- df %>% separate(x, c("ABC", "XYZ"))
df
ABC XYZ
1 <NA> <NA>
2 a x
3 b y
4 c z
By default, R will automatically look for any non-alphanumeric values as the separator. You can let R know which separator to use by adding a 'sep' argument into it. Also, if the number of columns given to R is less than the separation performed, the remaining values will be discarded automatically. For example...
df <- data.frame(x = c(NA, "a.x.c", "b.y.k", "c.z.g"))
df
x
1 <NA>
2 a.x.c
3 b.y.k
4 c.z.g
df <- df %>% separate(x, c("ABC", "XYZ"), sep = '.')
Warning message:
Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].
df
ABC XYZ
1 <NA> <NA>
2 a x
3 b y
4 c z
unite()
And lastly, we have unite() which is the opposite of separate(). It combines multiple columns together into one and uses a separator (optional) in between.
df
ABC XYZ
1 <NA> <NA>
2 a x
3 b y
4 c z
df <- df %>% unite("AtoZ", c("ABC", "XYZ"), sep = ";")
df
AtoZ
1 NA;NA
2 a;x
3 b;y
4 c;z
Cheat Sheets
Sometimes it may be difficult to remember all this or sometimes there might be functions that you may not know that exists in the dplyr and tidyr package. Don't worry as there are cheat sheets available for that can let you easily refer to these functions.
And that is all for dplyr and tidyr! Woohoo! We are FINALLY done with the data cleaning, data manipulation & data wrangling part of our learning journey. However, please take your time to familiarize yourself with these concepts as you will be spending around 70% to 85% of your time actually doing this rather than the data visualization or machine learning aspect.
Great job making it this far! I am so proud of you and you should be proud of yourself too. Before we move onto the data visualization, I will go off on a tangent and talk about Git and Github which are a essential skill for ALL coders / programmers / developers / data scientists. I know... I've been dragging data visualization for a while but there's just too many things I think we need before we can move on and trust me, learning Git & Github will be worth it. It will only take a day or two for you to learn it, it's that simple!
Anyways, I will be off from posting new posts at the start of February because I will be spending time with my family during Chinese New Year. However, I'll see if I have the time to squeeze out the Git & Github post before that. But until then, see you guys next time!