Module 3: Data in R

1. Types of Data

Before moving on to some exercises with actual data, it is necessary to understand how R treats and stores different types of data. If you are unsure what type of data you are working with, you can always use the typeof() function, which we will use in the following examples.

1.1 Doubles

Doubles are real numbers with a decimal value.

typeof(1.234)

[1] "double"

1.2 Integers

Integers are real whole numbers.

# R will default to storing numerical data as doubles
typeof(1)

# you can override this with the as.integer() function
typeof(as.integer(1))

# or you can use the : operator to create an integer sequence
typeof(1:4)

[1] "double"
[1] "integer"
[1] "integer"

1.3 Characters

Character data stores text. You cannot apply mathematical operations to text.

typeof("DARE")

[1] "character"

diff_data_vector<- c(1:10, seq(0,5,.5),"DARE")

typeof(diff_data_vector)

[1] "character"

If you try to apply a mathematical operation to diff_data_vector, you will receive an error because of "DARE".

1. 4 Logicals

Logicals are data that can only take two values: TRUE and FALSE (or T and F). Check out https://www.statmethods.net/management/operators.html for a nice list of logical operators such as greater than, less than or equal to, not equal to, etc.

2 > 45

2 + 1 == 3
typeof(2 + 1 == 3)

[1] FALSE
[1] TRUE
[1] "logical"

1. 5 Dates

You will inevitably work with dates in your research. In R, dates are internally stored as doubles, but with a date object class. Dates can be entered or converted using R's canned as.Date() function.

date<- as. Data("12/31/99", "%m/%d/%y")

date

[1] "1999-12-31"

typeof(date)
class(date)

[1] "double"
[1] "Date"

Since dates are stored as doubles, you can perform some mathematical operations, such as adding days.

date + 1
date + 31

[1] "2000-01-01"
[1] "2000-01-31"

2. Data Structures

Now that you have a good understanding of the way R treats different types of data, the next step is understanding how to store different types of data. Atomic vectors, matrices, and arrays are used for storing homogeneous data. When you save an object of one of these data structures, all the data within will be saved as one data type. Data frames and lists are structures used for storing different types of data in one object.

2.1 Atomic Vectors

The atomic vector is the most fundamental data structure in R. Contrasting a [1 x n] matrix vector or a one-dimensional array, atomic vectors have no dimension. I.e., atomic vectors cannot be classified as row or column vectors.

dim(diff_data_vector)

NULL

2.2 Matrices

A matrix is an extension of the atomic vector. Matrices are essentially atomic vectors with a specified number of rows and columns. Similar to atomic vectors, the elements of a matrix must be the same data type.

A<- matrix(c(10,8,

5, 12), ncol = 2, byrow = TRUE)
typeof(A)
dim(A)

[1] "double"
[1] 2 2

2.3 Arrays

Arrays are objects that can store data in more than two dimensions. Matrices only have two dimensions: rows and columns. The easiest way to conceptualize an array is to picture a data cube. Think of a Rubik's Cube, where a number is stored within each of the little colored boxes.

multiarray<- array(c(1:27), dim=c(row_Size=3,

column_Size = 3,
matrices = 3))
multiarray

## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 19 22 25
## [2,] 20 23 26
## [3,] 21 24 27

Similar to atomic vectors and matrices, you can select certain elements (or vectors/matrices) by indexing. Recall that leaving an index slot blank returns everything from that dimension.

# select the first element of the first row, first column, and first matrix
multiarray[1,1,1]
# select the first element of the second column for every matrix
multiarray[1,2, ]
# select all elements the third matrix
multiarray[ , ,3]

2.4 Factors

Factors store categorical information. For example, you may need to use some data that is ordinal in nature. Take for example Likert scale data. You cannot quantify how much greater strongly agree is than somewhat agree, but the ordering has meaning. By converting something to a factor using the factor() function, R will store it as a vector of integers with a corresponding set of character values.

likert_levels<- c("strongly disagree", "strongly agree",

"disagree", "agree",

"somewhat disagree", "somewhat agree",

"neutral")

typeof(likert_levels)

## [1] "character"

likert_levels<- factor(likert_levels,
# Specify the ordering using levels
levels = c("strongly agree", "agree", "somewhat agree",
"neutral", "somewhat disagree", "disagree",
"strongly disagree"))

typeof(likert_levels)

## [1] "integer"

# use the attributes function to see the levels
attributes(likert_levels)

$levels
[1] "strongly agree" "agree" "somewhat agree" "neutral"
[5] "somewhat disagree" "disagree" "strongly disagree"

$class
[1] "factor"

2.5 Data Frames

A data frame is comprised of equal length vectors with unique attributes for each vector, making it a rectangular 2-dimensional (rows and columns). In other words, a data frame is a matrix with column names. You can create a data frame using the data.frame() function, or by importing data with read.csv() or read.table().

my_first_df<- data.frame(numbers=1:4,

letters = c("a", "b", "c", "d"),
logicals = c(TRUE, FALSE, FALSE, TRUE),
dates = seq(as.Date("01/01/99", "%m/%d/%y"),
as.Date("01/01/02", "%m/%d/%y"),
"years")
)

my_first_df

## numbers letters logicals dates
## 1 1 a TRUE 1999-01-01
## 2 2 b FALSE 2000-01-01
## 3 3 c FALSE 2001-01-01
## 4 4 d TRUE 2002-01-01

Like matrices, you can select certain elements of a data frame using brackets. However, since our columns now have names, you can select columns by their name.

# If you select only one column, R will return an atomic vector
my_first_df[,"numbers"]
# If you select multiple columns, R will return another data frame
my_first_df[,c("letters", "dates")]

You can also reference columns and create a new variable using the $ operator.

my_first_df$logicals
my_first_df$logicals[1]

my_first_df$new_date <- my_first_df$dates + 7

2.6 Lists

In R, lists act as storage bins. Not only can you include different data types, you can store different data structures as well. You can create lists using the list() function.

my_first_list <- list( "DARE", logic, as. Data("08/15/2024", "%m/%d/%y"), A, multiarray, likert_levels, my_first_df)

my_first_list

## [[1]]
## [1] "DARE"
##
## [[2]]
## [1] FALSE TRUE TRUE TRUE
##
## [[3]]
## [1] "2024-08-15"
##
## [[4]]
## [,1] [,2]
## [1,] 10 8
## [2,] 5 12
##
## [[5]]
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 19 22 25
## [2,] 20 23 26
## [3,] 21 24 27
##
##
## [[6]]
## [1] strongly disagree strongly agree disagree agree
## [5] somewhat disagree somewhat agree neutral
## 7 Levels: strongly agree agree somewhat agree neutral ... strongly disagree
##
## [[7]]
## numbers letters logicals dates new_date
## 1 1 a TRUE 1999-01-01 1999-01-08
## 2 2 b FALSE 2000-01-01 2000-01-08
## 3 3 c FALSE 2001-01-01 2001-01-08
## 4 4 d TRUE 2002-01-01 2002-01-08

Indexing lists uses a double bracket [[]] versus the single bracket used for everything else.

my_first_list[[6]]

3. Importing Data

The data we will use for this exercise contains information on 887 passengers aboard the Titanic. The data can be downloaded here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html. When you download the Titanic data, you may notice the ".csv" extension. To import this data into R, we will use the read.csv() function. You can also import this data by clicking 'File > Import Dataset' from the menu bar, but this is not recommended.

You need to provide the read.csv() function with either a file path on your computer or a URL that links directly to the download (you can usually get the URL by right-clicking a download link on a webpage and copying the link address). If you copy and paste a file path from your file explorer, you will need to replace every \ with a /. The read.csv() function will import the data as a data frame.

# import using a file path
titanic <- read.csv("C:/Users/wming/Downloads/titanic.csv")#You need to replace the path with your own
names(titanic)
head(titanic, n = 5) # the head() function prints the first n=5 rows

# import using a url

titanic <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
names(titanic)
head(titanic, n = 5)

4. Exporting/Saving Data

If you want to export your data as a new file, you can use the write.csv() function. Make sure you save your altered data as a different filename from the raw data. NEVER save over your raw data, especially if you are collaborating. This is important for reproducibility.

# export our new data
write.csv(titanic, "C:/Users/wming/Downloads/titanic_V1.csv",
row.names = FALSE)

If you are using very large datasets and only work in R, consider saving your data as an .RData or .Rds file using the save() function. These are R-specific file types that compress data much better than .csv, i.e. they require less memory on your computer. These file formats can be imported using the load() function.