Data Frames and how to use them Part 1 - R Basics

Data Frames and how to use them Part 1 - R Basics

In this article, we will learn how to navigate, manipulate, import and export data frames, one of the most important data structures within R. Part 1

Hello and welcome back to my blog. In this post, we are going to be learning about Data Frames and how to use them. Data frames are essentially the main data structure you will be using 95% (just my estimate) of the time, so they are very important. Time to stay focused and pay full attention!

First of all, what is a data frame? For an easy explanation, the closest thing I can compare it to is an excel spreadsheet. Then you might be wondering, what makes Data Frames different from a matrix? They both have rows and columns...

That is a good question and a question I actually asked myself before starting to write this post. As we learnt previously, matrices are 2-dimensional vectors and what I mentioned in my vector post, vectors can only contain 1 type of data within them (i.e. numerics, logicals, characters). But for data frames, you can have multiple data types within a single data frame.

Let's look at an example in RStudio. Go ahead and type mtcars into the console. You should see a data frame appear in the console. mtcars is a built-in data frame that comes pre-installed with RStudio and there's actually a bunch of them. You can see the whole list by typing data() into the console and it should appear in the top-left window within RStudio. It contains the name of the data frame and a brief description of what the data frame is about.

Navigating data frames

Anyways, let's go back to our mtcars example. Here are some useful functions for data frame navigation. Due to some dataframes having potential thousands or millions of observations (rows), sometimes it's not necessary to see ALL of it ALL the time. In that case, we can use head(dataframe) or tail(dataframe) to see the first or last 6 observations within a dataframe.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

> tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Remember when I mentioned having multiple data types within a data frame? You can see the data type of each column in a data frame using the str(dataframe) funcion.

> str(mtcars)
'data.frame':    32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

First off, you see that the dataframe has 32 observations (rows) and 11 variables (columns). After that, we see every variable (column) along with its data type and the first few observations for it. You can also get a statistical summary of a data frame using the summary(dataframe) function.

> summary(mtcars)
      mpg             cyl             disp             hp             drat             wt             qsec             vs               am        
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695   Median :3.325   Median :17.71   Median :0.0000   Median :0.0000  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
      gear            carb      
 Min.   :3.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000  
 Median :4.000   Median :2.000  
 Mean   :3.688   Mean   :2.812  
 3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :8.000

In the summary, we can see the minimum, 1st & 3rd quartile, median, mean and maximum values for each variable. This is helpful in certain scenarios to detect anomalies within a set of data.

Creating a data frame

We can also create our own data frame by combining multiple vectors together using data.frame(). Let's do an example data frame of the weather in the past week!

#Create 4 vectors
> days <- c('Mon','Tue','Wed','Thu','Fri')
> temp <- c(25.4,23.2,24.1,24.0,23.2)
> rain <- c(F,T,F,T,T)
> cloudy <- c(F,T,T,T,T)

#Combining them into a data frame and assigning a variable to it
> weather <- data.frame(days,temp,rain,cloudy)
> weather
  days temp  rain cloudy
1  Mon 25.4 FALSE  FALSE
2  Tue 23.2  TRUE   TRUE
3  Wed 24.1 FALSE   TRUE
4  Thu 24.0  TRUE   TRUE
5  Fri 23.2  TRUE   TRUE

You can use str() and summary() to get a brief overview on our data frame. Go ahead and give it a try!

Importing and Exporting data frames

Now... most of the time, we won't be creating our own data frame, but instead we will be reading it from an excel file or csv file.

#Import data frame from csv file
> df <- read.csv('*filename*.csv')

#Import data frame from excel file (.xls or .xlsx)
#Load the readxl package
> library(readxl)
> read_excel('*filename*.xlsx',sheet='*sheetname*') function

#Output data frame into csv file
#It will save to your working directory by default
> write.csv(df, file='*filename*.csv')

You can get the number of rows and columns within a data frame by using the nrow(df) and ncol(df) functions. This will be useful if you need to perform a loop through all the rows or columns within a data frame. We will be touching on loops in the near future.

Adding Rows and Columns to data frames

In order to add one / a few / several new rows into an existing data frame, use rbind(dataframe1,dataframe2). Please take note that when adding rows, you must use a data frame to merge with another data frame (I'll explain why later) and the column names for both data frames must be the same otherwise R will pop out an error. Let's look back at our weather data frame.

> weather
  days temp  rain cloudy
1  Mon 25.4 FALSE  FALSE
2  Tue 23.2  TRUE   TRUE
3  Wed 24.1 FALSE   TRUE
4  Thu 24.0  TRUE   TRUE
5  Fri 23.2  TRUE   TRUE

#First we create a new data frame for our new data
> weather.new <- data.frame('Sat',24.7,T,T)
#Then we give column names to our new data frame
> names(weather.new) <- c('days', 'temp', 'rain', 'cloudy')
> weather.new
  days temp rain cloudy
1  Sat 24.7 TRUE   TRUE

#Now that our new data is a data frame, we can merge them
> weather <- rbind(weather,weather.new)
> weather
  days temp  rain cloudy
1  Mon 25.4 FALSE  FALSE
2  Tue 23.2  TRUE   TRUE
3  Wed 24.1 FALSE   TRUE
4  Thu 24.0  TRUE   TRUE
5  Fri 23.2  TRUE   TRUE
6  Sat 24.7  TRUE   TRUE

Now, the reason why we can't use a vector to merge with a data frame is because if you remember, vectors can only contain one data type. For example we had Sunday data ('Sun',25.8,F,F), there are characters, numerics and Boolean values in it. So if you made it into a vector, all of the data will be read as characters. So if let's say if we were to merge this... (don't follow if you are following along)

> sunday <- c('Sun',25.8,F,F)
> weather <- rbind(weather,sunday)
> weather
  days temp  rain cloudy
1  Mon 25.4 FALSE  FALSE
2  Tue 23.2  TRUE   TRUE
3  Wed 24.1 FALSE   TRUE
4  Thu   24  TRUE   TRUE
5  Fri 23.2  TRUE   TRUE
6  Sat 24.7  TRUE   TRUE
7  Sun 25.8 FALSE  FALSE

#Everything looks normal right??? Until you see the structure...
> str(weather)
'data.frame':    7 obs. of  4 variables:
 $ days  : chr  "Mon" "Tue" "Wed" "Thu" ...
 $ temp  : chr  "25.4" "23.2" "24.1" "24" ...
 $ rain  : chr  "FALSE" "TRUE" "FALSE" "TRUE" ...
 $ cloudy: chr  "FALSE" "TRUE" "TRUE" "TRUE" ...

All the columns within the data frame have turned into the character data type so you can't perform any arithmetic operations on them.

Now on the other hand, when adding a new column into a data frame, we CAN use a vector to do so because variables (columns) generally only have 1 data type. Go ahead and try to add in a humidity column (83,77,85,75,66,89) into the weather data frame. It should look like this at the end of it.

> weather
  days temp  rain cloudy humidity
1  Mon 25.4 FALSE  FALSE       83
2  Tue 23.2  TRUE   TRUE       77
3  Wed 24.1 FALSE   TRUE       85
4  Thu 24.0  TRUE   TRUE       75
5  Fri 23.2  TRUE   TRUE       66
6  Sat 24.7  TRUE   TRUE       89

> str(weather)
'data.frame':    6 obs. of  5 variables:
 $ days    : chr  "Mon" "Tue" "Wed" "Thu" ...
 $ temp    : num  25.4 23.2 24.1 24 23.2 24.7
 $ rain    : logi  FALSE TRUE FALSE TRUE TRUE TRUE
 $ cloudy  : logi  FALSE TRUE TRUE TRUE TRUE TRUE
 $ humidity: num  83 77 85 75 66 89

Alright, this is where I'm stopping for today. Obviously, there will be more on data frames and how to operate, navigate and manipulate them but it will make this post way too long if I shoved them all into one single post.

Stay tuned for part 2 of R Basics - Frame, Frame, Data Frame. Just a heads up, I will be working outstation for the remainder of the year so the next post should be coming out at the start of next year. I'll do my best and see whether I can squeeze the time to finish writing part 2 before the end of the year. See you next time!