Data Frames and How to use them Part 2

Hey hey, if you are seeing before 1st Jan 2022, it means I MADE IT! I found the time to finish up this section so that we don't end the year on a hanger. Now, without further ado, let's get this ball rollin' and say bye bye to 2021. 2022 will definitely be a better year (hope this ages well).

Referencing, Selection and Indexing

This part is basically the same as what you learnt in the matrix post, where indexing is done using square brackets, the row number and column number of your chosing [rownum,colnum].

You can also use the row or column names as an index so you don't have to manually count the rows or columns every time you need something. Let's use the mtcars data frame as our example.

#Assigning the mtcars data frame to a df variable
> df <- mtcars
#Let's look at the first few rows for reference
> head(df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#If we wanted the 'hp' column, using [,*colname*] will return a vector
> df[,'hp']
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

#or using the column number within our df data frame
> df[,4]
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

#My preferred method - using the dollar sign ($) with the column name
> df$hp
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

If you only either the row or column name in your indexing, it will return as a data frame rather than a vector

> head(df['hp'])
                     hp
Mazda RX4           110
Mazda RX4 Wag       110
Datsun 710           93
Hornet 4 Drive      110
Hornet Sportabout   175
Valiant             105

> head(df[1])
                     hp
Mazda RX4           110
Mazda RX4 Wag       110
Datsun 710           93
Hornet 4 Drive      110
Hornet Sportabout   175
Valiant             105

Interesting fact! (for me anyways) You can use negative numbers in your indexing to exclude certain rows or columns within a data frame.

Filtering your data frame

To filter observations (rows) that meet certain criteria within R, we use the subset() function. It's very simple, all you need to do is put in the data frame and the condition. Let's use our weather data frame from the previous post and try to get the days where it DID NOT rain. I'll post the code here for anyone that didn't save or close R after the previous post.

days <- c('Mon','Tue','Wed','Thu','Fri','Sat')
temp <- c(25.4,23.2,24.1,24.0,23.2,24.0)
rain <- c(F,T,F,T,T,T)
cloudy <- c(F,T,T,T,T,T)
humidity <- c(83,77,85,75,66,89)
weather <- data.frame(days,temp,rain,cloudy,humidity)

> weather
  days temp  rain cloudy humidity
1  Mon 25.4 FALSE  FALSE       83
2  Tue 23.2  TRUE   TRUE       77
3  Wed 24.1 FALSE   TRUE       85
4  Thu 24.0  TRUE   TRUE       75
5  Fri 23.2  TRUE   TRUE       66
6  Sat 24.0  TRUE   TRUE       89

#To filter the non-rainy days
> sunny <- subset(weather, rain == F)
> sunny
  days temp  rain cloudy humidity
1  Mon 25.4 FALSE  FALSE       83
3  Wed 24.1 FALSE   TRUE       85

#To have multiple criteria (non-rainy & non-cloudy) we use the **&** symbol
> super.sunny <- subset(weather, rain == F & cloudy == F)
> super.sunny
  days temp  rain cloudy humidity
1  Mon 25.4 FALSE  FALSE       83

#To do OR criterias (meaning either one has to be satisfied
#We use the **|** symbol (above your enter, below your backspace)
> grey.days <- subset(weather, rain == T | cloudy == T)
> grey.days
  days temp  rain cloudy humidity
2  Tue 23.2  TRUE   TRUE       77
3  Wed 24.1 FALSE   TRUE       85
4  Thu 24.0  TRUE   TRUE       75
5  Fri 23.2  TRUE   TRUE       66
6  Sat 24.0  TRUE   TRUE       89

And that's all there is to it!

Setting / Changing Column Names

In my previous post, we talked about how to add rows and columns into an existing data frame using rbind() and cbind(). But what if we want to change the column names in our data frame? For this, we can use the colnames() function.

Let's say I want to change 'temp' to 'temperature' because some people might mistake temp as temporary. To do this, we use colnames then put in the column index and assign a new name for it.

> colnames(weather)[2] <- 'temperature'

> weather 
  days temperature  rain cloudy humidity
1  Mon        25.4 FALSE  FALSE       83
2  Tue        23.2  TRUE   TRUE       77
3  Wed        24.1 FALSE   TRUE       85
4  Thu        24.0  TRUE   TRUE       75
5  Fri        23.2  TRUE   TRUE       66
6  Sat        24.0  TRUE   TRUE       89

This is some optional knowledge, skip to the next section if you aren't too interested in the technical details. If you noticed, the way we did the indexing is slightly different from what we usually do. Usually, we include the indexing within the function e.g. colnames(weather[2]). However, colnames() will actually return a vector, so what we are doing is telling R to take the 2nd element from the colnames() vector which is why the [2] is outside the function.

If you want to change all the column names at once, you can use a vector.

> colnames(weather) <- c('one','two','three','four','five')
> weather
  one  two three  four five
1 Mon 25.4 FALSE FALSE   83
2 Tue 23.2  TRUE  TRUE   77
3 Wed 24.1 FALSE  TRUE   85
4 Thu 24.0  TRUE  TRUE   75
5 Fri 23.2  TRUE  TRUE   66
6 Sat 24.0  TRUE  TRUE   89

If you do not have enough elements within your vector for all the columns, the remainder will show up as NA.

#Let's change back the column names
> colnames(weather) <- c('A','B','C')
> weather
    A    B     C    NA NA
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2  TRUE  TRUE 77
3 Wed 24.1 FALSE  TRUE 85
4 Thu 24.0  TRUE  TRUE 75
5 Fri 23.2  TRUE  TRUE 66
6 Sat 24.0  TRUE  TRUE 89

Ordering a Data Frame

We can also sort the order of our data frame using the order() function. It's quite simple, you basically just put in the column that you want to order into the order() function and it will return a vector of indexes after the data has been sorted. The explanation might be a bit complicated, I'll show you what I mean. Looking back at our weather data frame, let's say we want to sort the data based on the temperature in ascending order.

#Let's change back the column names
> colnames(weather) <- c('days','temp','rain','cloudy','humidity')
> weather
  days temp  rain cloudy humidity
1  Mon 25.4 FALSE  FALSE       83
2  Tue 23.2  TRUE   TRUE       77
3  Wed 24.1 FALSE   TRUE       85
4  Thu 24.0  TRUE   TRUE       75
5  Fri 23.2  TRUE   TRUE       66
6  Sat 24.0  TRUE   TRUE       89

> sorted.temp <- order(weather$temp)
> sorted.temp
[1] 2 5 4 6 3 1

#Now we can use this vector to sort our data
> weather[sorted.temp,]
  days temp  rain cloudy humidity
2  Tue 23.2  TRUE   TRUE       77
5  Fri 23.2  TRUE   TRUE       66
4  Thu 24.0  TRUE   TRUE       75
6  Sat 24.0  TRUE   TRUE       89
3  Wed 24.1 FALSE   TRUE       85
1  Mon 25.4 FALSE  FALSE       83

#To do a descending sort, we just have to put a minus sign in front of our column
> desc.temp <- order(-weather$temp)
> desc.temp
[1] 1 3 4 6 2 5
> weather[desc.temp,]
  days temp  rain cloudy humidity
1  Mon 25.4 FALSE  FALSE       83
3  Wed 24.1 FALSE   TRUE       85
4  Thu 24.0  TRUE   TRUE       75
6  Sat 24.0  TRUE   TRUE       89
2  Tue 23.2  TRUE   TRUE       77
5  Fri 23.2  TRUE   TRUE       66

Dealing with missing data

Arguably one of the most important aspects of data preprocessing for any data analyst/scientist. In the real world, the data will never be as clean as we hope, there will always be missing data fields and it is up to us to do whatever we can with the data we have. Before going in deeper, there is a very useful function called is.XX that can let you check whether the data is in a certain structure or meets a certain criteria, some examples are:

is.na (checks whether the data is a N/A value)
is.data.frame (checks whether the data structure is a data frame)
is.character (checks whether the data is a character data type)

There are many, many more is.XX functions in R. Feel free to explore them in your free time. However, the con of the is.XX function is that it will return a vector of Boolean values for all the data within your data set. Let's see an example of this:

> test.data <- c(1,2,3,4,NA,5,NA,7,8,9)
> is.na(test.data)
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE

Now imagine if our data frame had more than 1000 lines... We wouldn't be able to see through all of them. To get a short and sweet answer, we can use the any() function to return a single value and get a single value on whether there are any missing values.

> any(is.na(test.data))
[1] TRUE

There are a few ways to tackle missing data so it is up to the severity of the missing data and your judgement on which method is best suited for the specific situation.

First of all, we can just remove the missing data from our data set and there are 2 ways to do it.

#Let's prepare a simple data frame as an example
> test.data <- c(1,2,3,4,NA,5,NA,7,8,9)
> abc <- c(33,55,45,86,NA,NA,NA,NA,99,97)
> test.data <- data.frame(test.data,abc)
> test.data
   test.data abc
1          1  33
2          2  55
3          3  45
4          4  86
5         NA  NA
6          5  NA
7         NA  NA
8          7  NA
9          8  99
10         9  97

#Using the exclusion symbol '!' and indexing
#This function will exclude the observations that are NA within a specific column
> test.omit <- test.data[!is.na(test.data$test.data),]
> test.omit
   test.data abc
1          1  33
2          2  55
3          3  45
4          4  86
6          5  NA
8          7  NA
9          8  99
10         9  97

#Using na.omit()
#This function will remove ALL observations that have ANY NA values within them
> test.omit <- na.omit(test.data)
> test.omit
   test.data abc
1          1  33
2          2  55
3          3  45
4          4  86
9          8  99
10         9  97

But sometimes you might want to include the missing data within your calculation or analysis so you just want to replace the NA value with a 0 or a specific value. We can just combine the skills we learnt previously (is.na, indexing, assignment) and put them in action!

#This function will change all NA values within the data frame to 0
test.data[is.na(test.data)] <- 0

#This function will change all NA values within a specific column
test.data$abc[is.na(test.data$abc)] <- 0

Phew~~~ Finally we managed to get through data frames, there's actually a lot more to cover with data frames but I think you have enough knowledge to get started! After this, I have one final data structure to cover called Lists which you might use occasionally but it's still important to learn nonetheless. If you made it this far, give yourself a pat on the back. You deserve it! You made it this far and you should be proud of yourself. We are almost done with the basics and we will slowly move into the more exciting parts of R and programming. Plotting graphs and doing analysis comes afterwards. Thank you for reading all this way, feel free to subscribe to my newsletter for future posts. Happy new year!!!

Data Frames and How to use them Part 2 - R Basics

In this article, we will learn how to navigate, manipulate, sort & filter data frames, one of the most important data structures within R. Part 2

Table of contents

Referencing, Selection and Indexing

Filtering your data frame

Setting / Changing Column Names

Ordering a Data Frame

Dealing with missing data