Colorado State University
Module 4: Data manipulation
The purpose of this module is to get you accustomed to using the popular packages available in the tidyverse collection. The tools provided in these packages should meet most of your typical day-to-day data manipulation needs. Data manipulation refers to the process of modifying and applying logic to data to make it more organized and meaningful.
1. What are R Packages?
If you are struggling to figure out how to accomplish a task using base R functions, chances are someone has created a package to do exactly what you need. The best part is, most packages can be installed and loaded directly through RStudio.
Some functions are built into R: mean(), lm(), table(), etc. They actually come from built-in packages
base, stats, graphics, etc. Anyone (yes, anyone) build their own package to add to the functionality of R
ggplot2, dplyr, data.table, etc.
​
When you use install.packages, packages are downloaded from CRAN (The Comprehensive R Archive Network). This is also where you downloaded R. Packages can be hosted lots of other places, such as Github (for personal projects or while still developing). The folks at CRAN check to make things "work" in some sense, but don't check on the statistical methods...But because R is open-source, you can always read the code yourself. Two functions from different packages can have the same name... if you load them both, you may have some trouble.
​
The packages that you will find most useful are conveniently bundled within the tidyverse package (The same people who make RStudio also are responsible for a set of packages called the tidyverse):
ggplot2 – generate nearly any quantitative plot you would find in a scientific journal
dplyr – shortcuts for subsetting, summarizing, rearranging, and joining datasets
tidyr – intuitive functions for changing the layout of your data
stringr – necessary functions for working with character data
lubridate – tools that make working with dates and times way easier
​
To utilize any of the packages listed above, type and enter the following into the command line.
install.packages("tidyverse")
Then, at the start of your next R script, type the following to load your new packages:
​library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
R will prioritize using the functions loaded in the beginning of the script. It is worth noting that you can also use library(tidyverse) to load several (but not all) of the tidyverse packages, but this is not recommended.
2. Manipulating the Titanic Data
Simplify Your Code with the Pipe Operator (%>%). You can interpret %>% as “and then”. In the context of working with data, it might look something like: remove the missing observations %>% create a new set of variables %>% remove unwanted variables %>% group the data by category %>% generate summary statistics. The %>% operator can either be typed out manually or entered with the keyboard shortcut Ctrl+Shift+M.
​
​To start, preview the data using the summary() function.
summary(titanic)
2.1Filtering
The filter() function is used to remove or keep observations from a dataset based on one or multiple logical conditions.
titanic %>% filter(Fare == 0)
To remove the observations with Fare equal to zero, we will use the != logical operator, which means not equal to. The new dataset will be called titanic_clean.
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) # filter(Fare > 0) would also work
2.2 Making new variables with mutate()
Within mutate(), you call a variety of operations and functions to create new and interesting data. Variables are created using the = sign, and multiple variables can be created by separating statements with a ,.
​
In the following lines of code, several new variables are created:
-
age_sq: Age squared
-
alone: Binary variable that takes a value of 1 if the passenger was travelling alone and 0 otherwise. To do this, the ifelse() statement and | operator are used to output a value of 1 if Siblings.Spouses.Aboard or Parents.Children.Aboard is greater than zero. The OR operator, |, indicates that only one of the logical conditions needs to be true.
-
alone2: Same variable as alone, but created using the AND, &, operator, which requires both logical conditions to be true. In this case both Siblings.Spouses.Aboard and Parents.Children.Aboard need to be equal to zero.
-
age_group: Categorical variable indicating whether the passenger was an infant, toddler, child, teen, adult, middle age adult, or senior adult. This variable is created using the case_when() function, which allows you to “vectorize” multiple ifelse() statements. It essentially allows you to assign different outputs based on different logical conditions without having to nest several ifelse() statements. In the case below, different character outputs are assigned to different age ranges. The variable is then modified using factor() to define an ordinal ranking of the variable from youngest to oldest age group.
-
income_class: Categorical variable indicating the income class of the passenger. For this variable, I assume that passenger class (1st, 2nd, 3rd) is a suitable proxy for upper, middle, and lower income status.
-
Sex: A simple modification of the existing Sex variable to make the first letter capital.
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
)
2.3 Tidying Data
select() is used to keep a set of variables you are interested in, rename() is used to rename variables to a consistent manner, and arrange() to order the data.
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
) %>% # and then...
# only keep variables of interest
select(Survived, Sex, Age, alone, age_group, income_class)
%>% # and then...
# rename variables in a consistent manner (snake_case)
rename(survived = Survived,
sex = Sex,
age = Age) %>% # and then...
# order the data by age
arrange(age)
head(titanic_clean)
2.4 Grouping and Summarizing Data
From the tidyr vignette: "It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it's not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected."
​
Now that 80% of the work is done, we can do some analysis. One approach is to look at survival rates by income class, age category, or other relevant groups. To do this, we will use the following two functions:
​
-
group_by() - define groups within the data such that all following operations are performed by group rather than by observation. (ungroup() removes grouping)
-
summarize() – performs operations by groups and creates a new data frame with one row per group and one column per summary statistic.
​
Survival rates by income class:
titanic_clean %>%
# group by income class
group_by(income_class) %>%
# calculate survival rate by group
summarize(survival_rate = mean(survived))
Survival rates by age group:
titanic_clean %>%
group_by(age_group) %>%
summarize(survival_rate = mean(survived))
Survival rates by solo travelers:
titanic_clean %>%
group_by(alone) %>%
summarize(survival_rate = mean(survived))
Given the variability of survival rates across different variables, it may be interesting to look at survival rates conditional on multiple variables. These next lines of code create a new data frame, titanic_grouped, with survival rates by age group, sex, and income class. It may also be the case that some groups have very few passengers that fit the description (e.g., upper class female toddlers). Since survival rates can be deceptive depending on how many passengers fall into a certain group, we will use the n() function create a count.
titanic_grouped <- titanic_clean %>%
group_by(age_group, sex, income_class) %>%
summarize(count = n(),
survival_rate = mean(survived)) %>%
ungroup() %>%
arrange(-survival_rate, age_group)
head(titanic_grouped, 20)
Notice that a minus sign, -, can be used to change the ordering with arrange() from ascending (low to high) to descending (high to low).