Data Frames and how to use them Part 1 - R Basics
In this article, we will learn how to navigate, manipulate, import and export data frames, one of the most important data structures within R. Part 1
Hello and welcome back to my blog. In this post, we are going to be learning about Data Frames and how to use them. Data frames are essentially the main data structure you will be using 95% (just my estimate) of the time, so they are very important. Time to stay focused and pay full attention!
First of all, what is a data frame? For an easy explanation, the closest thing I can compare it to is an excel spreadsheet. Then you might be wondering, what makes Data Frames different from a matrix? They both have rows and columns...
That is a good question and a question I actually asked myself before starting to write this post. As we learnt previously, matrices are 2-dimensional vectors and what I mentioned in my vector post, vectors can only contain 1 type of data within them (i.e. numerics, logicals, characters). But for data frames, you can have multiple data types within a single data frame.
Let's look at an example in RStudio. Go ahead and type mtcars into the console. You should see a data frame appear in the console. mtcars is a built-in data frame that comes pre-installed with RStudio and there's actually a bunch of them. You can see the whole list by typing data() into the console and it should appear in the top-left window within RStudio. It contains the name of the data frame and a brief description of what the data frame is about.
Navigating data frames
Anyways, let's go back to our mtcars example. Here are some useful functions for data frame navigation. Due to some dataframes having potential thousands or millions of observations (rows), sometimes it's not necessary to see ALL of it ALL the time. In that case, we can use head(dataframe) or tail(dataframe) to see the first or last 6 observations within a dataframe.
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Remember when I mentioned having multiple data types within a data frame? You can see the data type of each column in a data frame using the str(dataframe) funcion.
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
First off, you see that the dataframe has 32 observations (rows) and 11 variables (columns). After that, we see every variable (column) along with its data type and the first few observations for it. You can also get a statistical summary of a data frame using the summary(dataframe) function.
> summary(mtcars)
mpg cyl disp hp drat wt qsec vs am
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71 Median :0.0000 Median :0.0000
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000
gear carb
Min. :3.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000
Median :4.000 Median :2.000
Mean :3.688 Mean :2.812
3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :8.000
In the summary, we can see the minimum, 1st & 3rd quartile, median, mean and maximum values for each variable. This is helpful in certain scenarios to detect anomalies within a set of data.
Creating a data frame
We can also create our own data frame by combining multiple vectors together using data.frame(). Let's do an example data frame of the weather in the past week!
#Create 4 vectors
> days <- c('Mon','Tue','Wed','Thu','Fri')
> temp <- c(25.4,23.2,24.1,24.0,23.2)
> rain <- c(F,T,F,T,T)
> cloudy <- c(F,T,T,T,T)
#Combining them into a data frame and assigning a variable to it
> weather <- data.frame(days,temp,rain,cloudy)
> weather
days temp rain cloudy
1 Mon 25.4 FALSE FALSE
2 Tue 23.2 TRUE TRUE
3 Wed 24.1 FALSE TRUE
4 Thu 24.0 TRUE TRUE
5 Fri 23.2 TRUE TRUE
You can use str() and summary() to get a brief overview on our data frame. Go ahead and give it a try!
Importing and Exporting data frames
Now... most of the time, we won't be creating our own data frame, but instead we will be reading it from an excel file or csv file.
#Import data frame from csv file
> df <- read.csv('*filename*.csv')
#Import data frame from excel file (.xls or .xlsx)
#Load the readxl package
> library(readxl)
> read_excel('*filename*.xlsx',sheet='*sheetname*') function
#Output data frame into csv file
#It will save to your working directory by default
> write.csv(df, file='*filename*.csv')
You can get the number of rows and columns within a data frame by using the nrow(df) and ncol(df) functions. This will be useful if you need to perform a loop through all the rows or columns within a data frame. We will be touching on loops in the near future.
Adding Rows and Columns to data frames
In order to add one / a few / several new rows into an existing data frame, use rbind(dataframe1,dataframe2). Please take note that when adding rows, you must use a data frame to merge with another data frame (I'll explain why later) and the column names for both data frames must be the same otherwise R will pop out an error. Let's look back at our weather data frame.
> weather
days temp rain cloudy
1 Mon 25.4 FALSE FALSE
2 Tue 23.2 TRUE TRUE
3 Wed 24.1 FALSE TRUE
4 Thu 24.0 TRUE TRUE
5 Fri 23.2 TRUE TRUE
#First we create a new data frame for our new data
> weather.new <- data.frame('Sat',24.7,T,T)
#Then we give column names to our new data frame
> names(weather.new) <- c('days', 'temp', 'rain', 'cloudy')
> weather.new
days temp rain cloudy
1 Sat 24.7 TRUE TRUE
#Now that our new data is a data frame, we can merge them
> weather <- rbind(weather,weather.new)
> weather
days temp rain cloudy
1 Mon 25.4 FALSE FALSE
2 Tue 23.2 TRUE TRUE
3 Wed 24.1 FALSE TRUE
4 Thu 24.0 TRUE TRUE
5 Fri 23.2 TRUE TRUE
6 Sat 24.7 TRUE TRUE
Now, the reason why we can't use a vector to merge with a data frame is because if you remember, vectors can only contain one data type. For example we had Sunday data ('Sun',25.8,F,F), there are characters, numerics and Boolean values in it. So if you made it into a vector, all of the data will be read as characters. So if let's say if we were to merge this... (don't follow if you are following along)
> sunday <- c('Sun',25.8,F,F)
> weather <- rbind(weather,sunday)
> weather
days temp rain cloudy
1 Mon 25.4 FALSE FALSE
2 Tue 23.2 TRUE TRUE
3 Wed 24.1 FALSE TRUE
4 Thu 24 TRUE TRUE
5 Fri 23.2 TRUE TRUE
6 Sat 24.7 TRUE TRUE
7 Sun 25.8 FALSE FALSE
#Everything looks normal right??? Until you see the structure...
> str(weather)
'data.frame': 7 obs. of 4 variables:
$ days : chr "Mon" "Tue" "Wed" "Thu" ...
$ temp : chr "25.4" "23.2" "24.1" "24" ...
$ rain : chr "FALSE" "TRUE" "FALSE" "TRUE" ...
$ cloudy: chr "FALSE" "TRUE" "TRUE" "TRUE" ...
All the columns within the data frame have turned into the character data type so you can't perform any arithmetic operations on them.
Now on the other hand, when adding a new column into a data frame, we CAN use a vector to do so because variables (columns) generally only have 1 data type. Go ahead and try to add in a humidity column (83,77,85,75,66,89) into the weather data frame. It should look like this at the end of it.
> weather
days temp rain cloudy humidity
1 Mon 25.4 FALSE FALSE 83
2 Tue 23.2 TRUE TRUE 77
3 Wed 24.1 FALSE TRUE 85
4 Thu 24.0 TRUE TRUE 75
5 Fri 23.2 TRUE TRUE 66
6 Sat 24.7 TRUE TRUE 89
> str(weather)
'data.frame': 6 obs. of 5 variables:
$ days : chr "Mon" "Tue" "Wed" "Thu" ...
$ temp : num 25.4 23.2 24.1 24 23.2 24.7
$ rain : logi FALSE TRUE FALSE TRUE TRUE TRUE
$ cloudy : logi FALSE TRUE TRUE TRUE TRUE TRUE
$ humidity: num 83 77 85 75 66 89
Alright, this is where I'm stopping for today. Obviously, there will be more on data frames and how to operate, navigate and manipulate them but it will make this post way too long if I shoved them all into one single post.
Stay tuned for part 2 of R Basics - Frame, Frame, Data Frame. Just a heads up, I will be working outstation for the remainder of the year so the next post should be coming out at the start of next year. I'll do my best and see whether I can squeeze the time to finish writing part 2 before the end of the year. See you next time!