Data Frames and How to use them Part 2 - R Basics
In this article, we will learn how to navigate, manipulate, sort & filter data frames, one of the most important data structures within R. Part 2
Hey hey, if you are seeing before 1st Jan 2022, it means I MADE IT! I found the time to finish up this section so that we don't end the year on a hanger. Now, without further ado, let's get this ball rollin' and say bye bye to 2021. 2022 will definitely be a better year (hope this ages well).
Referencing, Selection and Indexing
This part is basically the same as what you learnt in the matrix post, where indexing is done using square brackets, the row number and column number of your chosing [rownum,colnum].
You can also use the row or column names as an index so you don't have to manually count the rows or columns every time you need something. Let's use the mtcars data frame as our example.
#Assigning the mtcars data frame to a df variable
> df <- mtcars
#Let's look at the first few rows for reference
> head(df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#If we wanted the 'hp' column, using [,*colname*] will return a vector
> df[,'hp']
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
#or using the column number within our df data frame
> df[,4]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
#My preferred method - using the dollar sign ($) with the column name
> df$hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
If you only either the row or column name in your indexing, it will return as a data frame rather than a vector
> head(df['hp'])
hp
Mazda RX4 110
Mazda RX4 Wag 110
Datsun 710 93
Hornet 4 Drive 110
Hornet Sportabout 175
Valiant 105
> head(df[1])
hp
Mazda RX4 110
Mazda RX4 Wag 110
Datsun 710 93
Hornet 4 Drive 110
Hornet Sportabout 175
Valiant 105
Interesting fact! (for me anyways) You can use negative numbers in your indexing to exclude certain rows or columns within a data frame.
Filtering your data frame
To filter observations (rows) that meet certain criteria within R, we use the subset() function. It's very simple, all you need to do is put in the data frame and the condition. Let's use our weather data frame from the previous post and try to get the days where it DID NOT rain. I'll post the code here for anyone that didn't save or close R after the previous post.
days <- c('Mon','Tue','Wed','Thu','Fri','Sat')
temp <- c(25.4,23.2,24.1,24.0,23.2,24.0)
rain <- c(F,T,F,T,T,T)
cloudy <- c(F,T,T,T,T,T)
humidity <- c(83,77,85,75,66,89)
weather <- data.frame(days,temp,rain,cloudy,humidity)
> weather
days temp rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.0 TRUE TRUE 89
#To filter the non-rainy days
> sunny <- subset(weather, rain == F)
> sunny
days temp rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
3 Wed 24.1 FALSE TRUE 85
#To have multiple criteria (non-rainy & non-cloudy) we use the **&** symbol
> super.sunny <- subset(weather, rain == F & cloudy == F)
> super.sunny
days temp rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
#To do OR criterias (meaning either one has to be satisfied
#We use the **|** symbol (above your enter, below your backspace)
> grey.days <- subset(weather, rain == T | cloudy == T)
> grey.days
days temp rain cloudy humidity
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.0 TRUE TRUE 89
And that's all there is to it!
Setting / Changing Column Names
In my previous post, we talked about how to add rows and columns into an existing data frame using rbind() and cbind(). But what if we want to change the column names in our data frame? For this, we can use the colnames() function.
Let's say I want to change 'temp' to 'temperature' because some people might mistake temp as temporary. To do this, we use colnames then put in the column index and assign a new name for it.
> colnames(weather)[2] <- 'temperature'
> weather
days temperature rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.0 TRUE TRUE 89
This is some optional knowledge, skip to the next section if you aren't too interested in the technical details. If you noticed, the way we did the indexing is slightly different from what we usually do. Usually, we include the indexing within the function e.g. colnames(weather[2]). However, colnames() will actually return a vector, so what we are doing is telling R to take the 2nd element from the colnames() vector which is why the [2] is outside the function.
If you want to change all the column names at once, you can use a vector.
> colnames(weather) <- c('one','two','three','four','five')
> weather
one two three four five
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.0 TRUE TRUE 89
If you do not have enough elements within your vector for all the columns, the remainder will show up as NA.
#Let's change back the column names
> colnames(weather) <- c('A','B','C')
> weather
A B C NA NA
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.0 TRUE TRUE 89
Ordering a Data Frame
We can also sort the order of our data frame using the order() function. It's quite simple, you basically just put in the column that you want to order into the order() function and it will return a vector of indexes after the data has been sorted. The explanation might be a bit complicated, I'll show you what I mean. Looking back at our weather data frame, let's say we want to sort the data based on the temperature in ascending order.
#Let's change back the column names
> colnames(weather) <- c('days','temp','rain','cloudy','humidity')
> weather
days temp rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.0 TRUE TRUE 89
> sorted.temp <- order(weather$temp)
> sorted.temp
[1] 2 5 4 6 3 1
#Now we can use this vector to sort our data
> weather[sorted.temp,]
days temp rain cloudy humidity
2 Tue 23.2 TRUE TRUE 77
5 Fri 23.2 TRUE TRUE 66
4 Thu 24.0 TRUE TRUE 75
6 Sat 24.0 TRUE TRUE 89
3 Wed 24.1 FALSE TRUE 85
1 Mon 25.4 FALSE FALSE 83
#To do a descending sort, we just have to put a minus sign in front of our column
> desc.temp <- order(-weather$temp)
> desc.temp
[1] 1 3 4 6 2 5
> weather[desc.temp,]
days temp rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
6 Sat 24.0 TRUE TRUE 89
2 Tue 23.2 TRUE TRUE 77
5 Fri 23.2 TRUE TRUE 66
Dealing with missing data
Arguably one of the most important aspects of data preprocessing for any data analyst/scientist. In the real world, the data will never be as clean as we hope, there will always be missing data fields and it is up to us to do whatever we can with the data we have. Before going in deeper, there is a very useful function called is.XX that can let you check whether the data is in a certain structure or meets a certain criteria, some examples are:
- is.na (checks whether the data is a N/A value)
- is.data.frame (checks whether the data structure is a data frame)
- is.character (checks whether the data is a character data type)
There are many, many more is.XX functions in R. Feel free to explore them in your free time. However, the con of the is.XX function is that it will return a vector of Boolean values for all the data within your data set. Let's see an example of this:
> test.data <- c(1,2,3,4,NA,5,NA,7,8,9)
> is.na(test.data)
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
Now imagine if our data frame had more than 1000 lines... We wouldn't be able to see through all of them. To get a short and sweet answer, we can use the any() function to return a single value and get a single value on whether there are any missing values.
> any(is.na(test.data))
[1] TRUE
There are a few ways to tackle missing data so it is up to the severity of the missing data and your judgement on which method is best suited for the specific situation.
First of all, we can just remove the missing data from our data set and there are 2 ways to do it.
#Let's prepare a simple data frame as an example
> test.data <- c(1,2,3,4,NA,5,NA,7,8,9)
> abc <- c(33,55,45,86,NA,NA,NA,NA,99,97)
> test.data <- data.frame(test.data,abc)
> test.data
test.data abc
1 1 33
2 2 55
3 3 45
4 4 86
5 NA NA
6 5 NA
7 NA NA
8 7 NA
9 8 99
10 9 97
#Using the exclusion symbol '!' and indexing
#This function will exclude the observations that are NA within a specific column
> test.omit <- test.data[!is.na(test.data$test.data),]
> test.omit
test.data abc
1 1 33
2 2 55
3 3 45
4 4 86
6 5 NA
8 7 NA
9 8 99
10 9 97
#Using na.omit()
#This function will remove ALL observations that have ANY NA values within them
> test.omit <- na.omit(test.data)
> test.omit
test.data abc
1 1 33
2 2 55
3 3 45
4 4 86
9 8 99
10 9 97
But sometimes you might want to include the missing data within your calculation or analysis so you just want to replace the NA value with a 0 or a specific value. We can just combine the skills we learnt previously (is.na, indexing, assignment) and put them in action!
#This function will change all NA values within the data frame to 0
test.data[is.na(test.data)] <- 0
#This function will change all NA values within a specific column
test.data$abc[is.na(test.data$abc)] <- 0
Phew~~~ Finally we managed to get through data frames, there's actually a lot more to cover with data frames but I think you have enough knowledge to get started! After this, I have one final data structure to cover called Lists which you might use occasionally but it's still important to learn nonetheless. If you made it this far, give yourself a pat on the back. You deserve it! You made it this far and you should be proud of yourself. We are almost done with the basics and we will slowly move into the more exciting parts of R and programming. Plotting graphs and doing analysis comes afterwards. Thank you for reading all this way, feel free to subscribe to my newsletter for future posts. Happy new year!!!