Mathematical Functions, Regular Expressions & Timestamps - R Basics

In this article, we go over a few minor subjects that will complete our basic learning of R.

Hey everyone! I'm happy to say that this will be the final part of learning the basics of R. Hooray!!! After this, you should have the fundamental knowledge for data manipulation and start using R. Of course, there will be more to this such as the data visualization using the famous ggplot2 package and the machine learning aspect of it which I will cover after this. But before we get there, let's wrap up the basics.

Mathematical Functions

We already learnt some basic mathematical functions before this such as sum(), so here are just some extra functions which you may need along your data science journey.

  • abs() : computes the absolute value
  • mean() : computes the mean (average) of a range of values
  • round() : rounds values
  • ceiling() : rounds UP a value
  • floor() : rounds DOWN a value
  • trunc() : rounds a value to the nearest integer towards 0 (meaning 4.99 will trunc() to 4 and -3.56 will trunc() to -3)
  • sqrt() : calculates the sqrt of a value
  • cos(), sin(), tan()
  • exp() : exponent of a value

Regular Expression

Regular Expression has a whole other language on its own and if you work with extracting data from strings all the time and have the time to learn and master it, I believe this will be a godly skill. But for this tutorial, I will go through some basics so that you can get started.

RegExp functions

Let's go through some regular expression functions available in R. FYI, I won't be covering the regular expression language itself as there are a lot of components which would make this post very very long. Also, I don't think it's best to remember the regular expression language unless you work it regularly. Usually I just refer to online cheat sheets like this one when I need it.

grep() & grepl()

When used on a vector, grep() will return a vector of indexes of elements where the matching pattern was found while grepl() will return a vector of logical values (TRUE / FALSE) of elements where the matching pattern was found.

v <- c("abc", "def", "cba a", "aa")

grep("a", v)
[1] 1 3 4

grepl('a',v)
[1]  TRUE FALSE  TRUE  TRUE

Hopefully the example above can show you the main difference between grep() and grepl() where grep() returns the indexes where the pattern was matched and grepl() will return the logical values on whether the pattern was matched.

regexpr() & gregexpr()

The regexpr() function is slightly different where it will return a vector of integers with the same length as your vector indicating the character position in each string element at which the first match was found. However, if no match is found, it will return -1 as a value. The gregexpr() functions is essentially the same as regexpr() except that it finds all matches in each string.

v <- c("abc", "def", "cba a", "aa")

regexpr('a', v)
[1]  1 -1  3  1
attr(,"match.length")
[1]  1 -1  1  1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

gregexpr('a', v)
[[1]]
[1] 1
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 3 5
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 1 2
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

I know there is a lot to go through, but we just need to focus on the important aspects of this. Let's look at regexpr() first. The first vector (1, -1, 3, 1) shows the character position where the first match was found. The second vector (1, -1, 1, 1) shows whether there was a match found.

Now for gregexpr(), it generates a list with the details for every element. Looking at the 3rd element in the list [[3]], it returned the character positions for ALL matches found which in this case is position 3 & 5.

sub() & gsub()

Lastly we have sub() and gsub(), these two functions are mainly used for substituting a pattern within a string in R. The syntax for sub() and gsub() is:

sub(pattern,replacement,x)

gsub(pattern,replacement,x)

Pattern = the pattern or the string that you want to be replaced Replacement = the string to replace the pattern x = the vector or data frame to replace the strings

The main difference between sub() and gsub(), just like regexpr() and gregexpr() is that sub() only replaces the FIRST pattern that matches while gsub() replaces ALL the patterns that matches. Let's see a basic example to understand what I'm talking about here.

v <- c("abc", "def", "cba a", "aa")

# Replacing the first 'a' of each element with 'x' within vector v
sub('a', 'x', v)
[1] "xbc"   "def"   "cbx a" "xa"

#Replacing ALL 'a's with 'x' within vector v
gsub('a', 'x', v)
[1] "xbc"   "def"   "cbx x" "xx"

And that wraps up basic regular expressions functions in R.

Dates & Timestamps

Finally... our last beginner-level topic on R basics. Phew~ Great job making it this far, just one small chapter to go and we reached a milestone in our learning journey. Often in projects, we may have to deal with dates and timestamps if we were doing time-series analysis (which is crucial for me as I'm forecasting for a budget for my company).

Dates

Starting off, we have dates. First off, we can extract today's date from R by using Sys.Date().

Sys.Date()
[1] "2022-01-19"

Also, you can convert character strings into a Date format (because R might not know whether your data is a date) by using as.Date(). For R, the default date format is YYYY-MM-DD e.g. 2022-01-19. However, sometimes the data we receive is not in the usual format (I'm looking at you, America). To solve this issue, we just need to add an additional argument to tell R what format our date is in. This is the code to use for letting R know the format of the date.

  • %d : Day of the month (number)
  • %m : Month (number)
  • %b : Month (abbreviated)
  • %B : Month (full name)
  • %y : Year (2 digits)
  • %Y : Year (4 digits)

Here are some examples you may face in the real world.

as.Date("Jan-19-21", format = "%b-%d-%y")
[1] "2021-01-19"

as.Date("19/01/2021", format = "%d/%m/%Y")
[1] "2021-01-19"

as.Date("19$January^2021", format = "%d$%B^%Y")
[1] "2021-01-19"

That last example was kind of weird but it just shows you the flexibility R has when it come to formatting dates.

Time

Similar to Dates, R can also convert strings into time information and R uses a POSIXct object type to store it. I'm not familiar at the background or details regarding POSIXct but there's no need to understand that to do data science. If you are interested in it, feel free to look it up. To convert a character string into a POSIXct format, we use as.POSIXct() and it's syntax is exactly the same as as.Date(). This is the code to use for letting R know the format of the time.

  • %H : Hours as number (00-23)
  • %I : Hours as number (01-12)
  • %M : Minute as number (00-59)
  • %S : Second as number (00-61)
  • %p : AM/PM (only used with %I, not %H)

Here are some examples:

as.POSIXct("09:57:23", format = "%H:%M:%S")
[1] "2022-01-19 09:57:23 +08"

as.POSIXct("Nov-19-2021 9:23:11PM", format = "%b-%d-%Y %I:%M:%S%p")
[1] "2021-11-19 21:23:11 +08"

As you can see, when your character string only has the time and you do not specifically mention the date, it will default to today's date. The same goes for timezone as well.

Congratulations!!! We have just completed the basics of R and have closed out a chapter on our learning journey. After all this, you should be able to do basic data cleaning and manipulation using R, which is the first step in any project (provided we already have our data set ready). After this, the next step would be to proceed onto the 2nd part of learning data science, which is data visualisation.

Whoah whoah whoahhhhh, before we go there, I want to introduce to you to the tidyverse. The Tidyverse is a library of packages designed to make your life easier when it come to data cleaning and data manipulation. If you thought you were good using the functions in base R, you will be 10x more powerful once you learnt dplyr and tidyr, which are the main highlights. See you guys next time!