Big Data Analytics - Introduction to R

This section is devoted to introduce the users to the R programming language. R can be downloaded from the cran website. For Windows users, it is useful to install rtools and the rstudio IDE.

The general concept behind R is to serve as an interface to other software developed in compiled languages such as C, C++, and Fortran and to give the user an interactive tool to analyze data.

Navigate to the folder of the book zip file bda/part2/R_introduction and open the R_introduction.Rproj file. This will open an RStudio session. Then open the 01_vectors.R file. Run the script line by line and follow the comments in the code. Another useful option in order to learn is to just type the code, this will help you get used to R syntax. In R comments are written with the # symbol.

In order to display the results of running R code in the book, after code is evaluated, the results R returns are commented. This way, you can copy paste the code in the book and try directly sections of it in R.

# Create a vector of numbers
numbers = c(1, 2, 3, 4, 5)
print(numbers)

# [1] 1 2 3 4 5
# Create a vector of letters
ltrs = c('a', 'b', 'c', 'd', 'e')
# [1] "a" "b" "c" "d" "e"

# Concatenate both
mixed_vec = c(numbers, ltrs)
print(mixed_vec)
# [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"


Let’s analyze what happened in the previous code. We can see it is possible to create vectors with numbers and with letters. We did not need to tell R what type of data type we wanted beforehand. Finally, we were able to create a vector with both numbers and letters. The vector mixed_vec has coerced the numbers to character, we can see this by visualizing how the values are printed inside quotes.

The following code shows the data type of different vectors as returned by the function class. It is common to use the class function to "interrogate" an object, asking him what his class is.

### Evaluate the data types using class

### One dimensional objects
# Integer vector
num = 1:10
class(num)
# [1] "integer"

# Numeric vector, it has a float, 10.5
num = c(1:10, 10.5)
class(num)
# [1] "numeric"

# Character vector
ltrs = letters[1:10]
class(ltrs)
# [1] "character"

# Factor vector
fac = as.factor(ltrs)
class(fac)
# [1] "factor"


R supports two-dimensional objects also. In the following code, there are examples of the two most popular data structures used in R: the matrix and data.frame.

# Matrix
M = matrix(1:12, ncol = 4)
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    2    5    8   11
# [3,]    3    6    9   12
lM = matrix(letters[1:12], ncol = 4)
#     [,1] [,2] [,3] [,4]
# [1,] "a"  "d"  "g"  "j"
# [2,] "b"  "e"  "h"  "k"
# [3,] "c"  "f"  "i"  "l"

# Coerces the numbers to character
# cbind concatenates two matrices (or vectors) in one matrix
cbind(M, lM)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] "1"  "4"  "7"  "10" "a"  "d"  "g"  "j"
# [2,] "2"  "5"  "8"  "11" "b"  "e"  "h"  "k"
# [3,] "3"  "6"  "9"  "12" "c"  "f"  "i"  "l"

class(M)
# [1] "matrix"
class(lM)
# [1] "matrix"

# data.frame
# One of the main objects of R, handles different data types in the same object.
# It is possible to have numeric, character and factor vectors in the same data.frame

df = data.frame(n = 1:5, l = letters[1:5])
df
#   n l
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e


As demonstrated in the previous example, it is possible to use different data types in the same object. In general, this is how data is presented in databases, APIs part of the data is text or character vectors and other numeric. In is the analyst job to determine which statistical data type to assign and then use the correct R data type for it. In statistics we normally consider variables are of the following types −

• Numeric
• Nominal or categorical
• Ordinal

In R, a vector can be of the following classes −

• Numeric - Integer
• Factor
• Ordered Factor

R provides a data type for each statistical type of variable. The ordered factor is however rarely used, but can be created by the function factor, or ordered.

The following section treats the concept of indexing. This is a quite common operation, and deals with the problem of selecting sections of an object and making transformations to them.

# Let's create a data.frame
df = data.frame(numbers = 1:26, letters)
#      numbers  letters
# 1       1       a
# 2       2       b
# 3       3       c
# 4       4       d
# 5       5       e
# 6       6       f

# str gives the structure of a data.frame, it’s a good summary to inspect an object
str(df)
#   'data.frame': 26 obs. of  2 variables:
#   $numbers: int 1 2 3 4 5 6 7 8 9 10 ... #$ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...

# The latter shows the letters character vector was coerced as a factor.
# This can be explained by the stringsAsFactors = TRUE argumnet in data.frame

class(df)
# [1] "data.frame"

### Indexing
# Get the first row
df[1, ]
#     numbers  letters
# 1       1       a

# Used for programming normally - returns the output as a list
df[1, , drop = TRUE]
# $numbers # [1] 1 # #$letters
# [1] a
# Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

# Get several rows of the data.frame
df[5:7, ]
#      numbers  letters
# 5       5       e
# 6       6       f
# 7       7       g

### Add one column that mixes the numeric column with the factor column
df$mixed = paste(df$numbers, df$letters, sep = ’’) str(df) # 'data.frame': 26 obs. of 3 variables: #$ numbers: int  1 2 3 4 5 6 7 8 9 10 ...
# $letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... #$ mixed  : chr  "1a" "2b" "3c" "4d" ...

### Get columns
# Get the first column
df[, 1]
# It returns a one dimensional vector with that column

# Get two columns
df2 = df[, 1:2]

#      numbers  letters
# 1       1       a
# 2       2       b
# 3       3       c
# 4       4       d
# 5       5       e
# 6       6       f

# Get the first and third columns
df3 = df[, c(1, 3)]
df3[1:3, ]

#      numbers  mixed
# 1       1     1a
# 2       2     2b
# 3       3     3c

### Index columns from their names
names(df)
# [1] "numbers" "letters" "mixed"
# This is the best practice in programming, as many times indeces change, but
variable names don’t
# We create a variable with the names we want to subset
keep_vars = c("numbers", "mixed")
df4 = df[, keep_vars]

#      numbers  mixed
# 1       1     1a
# 2       2     2b
# 3       3     3c
# 4       4     4d
# 5       5     5e
# 6       6     6f

### subset rows and columns
# Keep the first five rows
df5 = df[1:5, keep_vars]
df5

#      numbers  mixed
# 1       1     1a
# 2       2     2b
# 3       3     3c
# 4       4     4d
# 5       5     5e

# subset rows using a logical condition
df6 = df[df\$numbers < 10, keep_vars]
df6

#      numbers  mixed
# 1       1     1a
# 2       2     2b
# 3       3     3c
# 4       4     4d
# 5       5     5e
# 6       6     6f
# 7       7     7g
# 8       8     8h
# 9       9     9i