The Joy of Data Wrangling in R

Introduction

This document was written to address the needs of many new data analysts who are interested in turning data into meaningful knowledge, but struggle finding a set of tools for such a task. Here you will explore how to work with large and complex data in R.

While visualizing and modeling are the ultimate goal for most data analysts, wrangling messy data is the most often unsung hero in this process. As any seasoned analyst knows, or will soon come to know, complex data sets are rarely ready to use “as is”. Meaning that they may contain missing data, text strings, inconsistently coded information, etc. Simply, bad input data yields poor outputs. This document addresses many of these scenarios. To do such, we will leverage the tidyverse and some of the included packages most commonly used.

In an effort to cover a much ground as possible, while retaining a sense of cohesion, the tutorial is broken down into five sections each covering a distinct theme:

Introduction to the tidyverse
Ingesting data with readr
Data wrangling in dplyr
Working with text with stringr
Tidying data with tidyr

Along the way, you will learn how to ingest data into R and will be exposed to a number of jargon and concepts required in your understanding of data manipulation in R. Table 1 should provide a summary of the content in this lesson.

Table 1: Summary of Chapter Packages and Functions

Package	Function(s)	Short Description
tidyverse		tidyverse is a collection of R packages designed for data science. Each package used in this document (readr, dplyr, stringr and tidyr) are all part of this collection.
readr	`read_csv()`	readr is a package designed to read rectangular data, such as csv, tsv, etc.
dplyr	`arrange()`, `filter()`, `group_by()`, `mutate()`, `rename()`, `select()`, & `summarise()`	dplyr is a tidyverse package for data manipulation. It provides a consistent set of verbs that enable the analyst to solve the most common data manipulation challenges such as subsetting, transforming, and summarizing data. The functions used in this chapter are among the most useful in the package.
stringr	`str_length()`, `str_c()`, `str_sub()`, `str_to_lower()`, `str_to_upper()`, `str_to_title()`, `str_view()`, `str_view_all()`, `str_detect()`, `str_count`, `str_extract()`, & `str_extract_all()`	stringr is a package for working with strings built on top of stringi. In this document we cover three families of functions in stringr*; namely, character manipulation, handling whitespace, and pattern matching.
tidyr	`unnest()`	tidyr is a package designed for transforming data sets into tidy data where each variable is a column, each observation is a row, each value is a cell.

Conventions Used in This Tutorial

This tutorial follows the typographical conventions used in R for Data Science (Wickham and Grolemund 2017):

Italic - Indicating new terms, URLs, email addresses, file names, and file extensions.

Bold - Indicating the names of R packages.

Constant width - Used for program listings, as well as within paragraphs to refer to elements such as variable or function names, databases, data types, etc. In other words, it denotes code listing that should be typed as is or previously defined objects.

Constant width italic - Text that should be replaced with user-supplied values or determined by context.

Constant width bold - Shows commands or other text that users should type literally.

A Getle (Brief and Broad) Introduction to the tidyverse

The focus of this document is to highlight tools from the so-called tidyverse, which is a collection of R packages designed for data science. Packages in this collection share a grammar and design, which means that they are meant to work together seamlessly (Wickham and Grolemund 2017). In order to install the complete tidyverse use the install.packages() function to retrieve the package from CRAN. In order to successfully install it, make sure you are connected to the internet.

install.packages("tidyverse")

Once the installation is completed (sit tight as this may take a couple minutes), you should be able to load the package with the library() function:

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0.9000     ✔ purrr   0.3.2     
## ✔ tibble  2.1.3          ✔ dplyr   0.8.3     
## ✔ tidyr   0.8.3          ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

After loading, you should see a message that tells you which tidyverse packages were loaded under Attaching packages (e.g., ggplot, tibble, tidyr, etc.). Additionally, you should see a Conflicts section, don’t worry about this for now. This section tells you that some functions in the tidyverse are making functions with similar names in your working environment.

Now that we have loaded the tidyverse let’s jump in and begin exploring some of the packages in this collection!

Loading Data with readr

The first step in a typical data science project consists of importing data. This typically means that you must take data stored locally or remotely and load it into R. The simplest form of data import into R is a simple text file. Thus, here we will exclusively work with comma-separated values (.csv extension), which are delimited text files using comas to separate values. Keep in mind that R reads entire data into RAM all at once and objects live in memory entirely. As such, you should consider the size of a data set and your machine before beginning this ingestion process.

Before we get started with data ingestion in the tidyverse, let’s get acquainted with the data. Here we will work with data from the social networking platform Twitter, which is an online news and social networking site for micro-blogging. Specifically, data used by the FiveThrityEight story Why We’re Sharing 3 Million Russian Troll Tweets, made available to the public via GitHub. The data set contains data on nearly 3 million tweets sent from handles linked to the Internet Research Agency (IRA), which is the Russian organization linked to the 2016 election interference.

The data was made available to FiveThirtyEight by Clemenson University who gathered this information using a tool called Social Studio¹ and a list of IRA-connected handles included in the November 2017 and June 2018 provided to Congress by special counsel Robert Muller. The resulting data set contains every tweet from the 2,752 handles of interest on the 2017 report between May 10, 2015 and November 2017. Furthermore, the 946 handles from the 2018 report are included beginning on June 19, 2015.

The Github repository containing these data lists multiple csv files. For instance the first csv can be found in raw (text) format under the hyperlink:

https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv

Using this URL, let’s download this data into our environment using the read_csv() function from the readr package, which is included as part of the tidyverse, like so:

df <- read_csv(file = "https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv")

## Warning: 2803 parsing failures.
##  row        col           expected                   actual                                                                                                   file
## 1149 tco3_step1 1/0/T/F/TRUE/FALSE http://ow.ly/KH2T30a9YGX 'https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv'
## 1397 tco3_step1 1/0/T/F/TRUE/FALSE https://goo.gl/hV3VlX    'https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv'
## 2153 tco3_step1 1/0/T/F/TRUE/FALSE http://ow.ly/NRHJ30aPVFM 'https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv'
## 2164 tco3_step1 1/0/T/F/TRUE/FALSE https://goo.gl/CzsrEU    'https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv'
## 4319 tco3_step1 1/0/T/F/TRUE/FALSE http://bit.ly/sonar_2016 'https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv'
## .... .......... .................. ........................ ......................................................................................................
## See problems(...) for more details.

The resulting object, a data frame named df, contains 243,891 observations (rows) and 21 variables (columns). To get a sense of the data contained within this object, use the head(), str(), and names() functions from base R.

head(df)
str(df)
names(df)

Additionally, in order to familiarize yourself withe the data, take a look at the following is a code book from FiveThirtyEight’s Github, which define each variable in the data:

Table 2: Data set variable code book

Variable	Definition
`external_author_id`	An author account ID from Twitter
`author`	The handle sending the tweet
`content`	The text of the tweet
`region`	A region classification, as determined by Social Studio
`language`	The language of the tweet
`publish_date`	The date and time the tweet was sent
`harvested_date`	The date and time the tweet was collected by Social Studio
`following`	The number of accounts the handle was following at the time of the tweet
`followers`	The number of followers the handle had at the time of the tweet
`updates`	The number of “update actions” on the account that authored the tweet, including tweets, re-tweets and likes
`post_type`	Indicates if the tweet was a re-tweet or a quote-tweet
`account_type`	Specific account theme, as coded by Linvill and Warren
`retweet`	A binary indicator of whether or not the tweet is a re-tweet
`account_category`	General account theme, as coded by Linvill and Warren
`new_june_2018`	A binary indicator of whether the handle was newly listed in June 2018
`alt_external_id`	Reconstruction of author account ID from Twitter, derived from `article_url` variable and the first list provided to Congress
`tweet_id`	Unique id assigned by twitter to each status update, derived from `article_url`
`article_url`	Link to original tweet. Now redirects to “Account Suspended” page
`tco1_step1`	First redirect for the first http(s)://t.co/ link in a tweet, if it exists
`tco2_step1`	First redirect for the second http(s)://t.co/ link in a tweet, if it exists
`tco3_step1`	First redirect for the third http(s)://t.co/ link in a tweet, if it exists

Keep in mind that while we have ingested data from a website, you can easily utilize the read_csv() function to read local files, like so:

df <- read_csv(file = "path/to/your/local/file")

As you may be able to tell by dimensions and complexity of the Twitter data set, we will need many tools to work with this data. Luckily, R is just the platform for the task!

OPTIONAL: At this point, you may choose to proceed with this tutorial using the df object created from a single csv file. However, if your machine is able to handle larger data, you may want to run the following code to import all csvs. Note that this step is optional and may not be appropriate for all machines. If you choose to proceed with this bulk importation, your resulting data frame should be just short of 3 million observations. Note that this process might take a couple minutes. Additionally, this document was written using the df object created from a single csv file.

n <- 1:13
url <- "https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_"
extension <- ".csv"

files <- paste0(url, n, extension)

df <- map_dfr(files, read_csv)

Data wrangling in dplyr

Now that the importation process is complete, we will explore dplyr, which was designed for data manipulation. As such, it contains a set of functions (here referred to as “verbs”) that enable users to perform common data manipulation operations. If you are familiar with base R, you will notice that the functions within dplyr mimic the functionality of various base commands; however, the underlying philosophy and syntax in dplyr make data manipulation easier.

Before we dive too deep into the package, take a look at the following table which highlights noteworthy dplyr functions:

Table 3: Summary of dplyr functions

Function	Definition
`%>%`	The pipe operator, originally attributed to the magrittr library, allows you to string along functions reducing the number of nested function calls and making it really easy to add step by step operations
`arrange()`	Arrange rows by variable
`filter()`	Select and keep rows based on a set of parameters
`group_by()`	Groups observations by one or more variables, enabling you to make operations on groups
`mutate()`	Add new variables to a data set, while preserving existing variables
`rename()`	Change column names
`select()`	Keeps only the columns (variables) you mention in the function call
`summarise()`	Creates a summary of a vector of values, most useful when paired with `group_by()`

Don’t worry, you do not need to memorize all these functions. You will become more comfortable with them with experience. However, it is important that you know how to get help as needed. Remember that you can look up a function by pairing the function name with ? in the console, like so: ?select(). Furthermore, you can find a cheat sheet here.

dplyr: What if I have too much extraneous variables?

This is not an uncommon question as many data sets might contain variables that may be related to a specific project. Let’s begin by selecting a set of relevant variables and renaming them. In order to reduce the number of variables from our larger df data frame, let’s zoom in on the following variables:

external_author_id: The unique numeric identifier for each Twitter account
author: The user account handle
content: The body of each tweet or re-tweet
language: The language of the tweet
publish_date: The date and time the tweet was sent
post_type: Indicates if the tweet was a re-tweet or a quote-tweet
account_category: General account theme, as coded by Linvill and Warren
tweet_id: Unique id assigned by twitter to each status update, derived from article_url

df %>%
  select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id, handle = author, tweet = content)

## # A tibble: 243,891 x 8
##        uid handle tweet language publish_date post_type account_category
##      <dbl> <chr>  <chr> <chr>    <chr>        <chr>     <chr>           
##  1 9.06e17 10_GOP "\"W… English  10/1/2017 1… <NA>      RightTroll      
##  2 9.06e17 10_GOP Mars… English  10/1/2017 2… <NA>      RightTroll      
##  3 9.06e17 10_GOP Daug… English  10/1/2017 2… RETWEET   RightTroll      
##  4 9.06e17 10_GOP JUST… English  10/1/2017 2… <NA>      RightTroll      
##  5 9.06e17 10_GOP 19,0… English  10/1/2017 2… RETWEET   RightTroll      
##  6 9.06e17 10_GOP "Dan… English  10/1/2017 2… <NA>      RightTroll      
##  7 9.06e17 10_GOP 🐝🐝🐝 … English  10/1/2017 2… RETWEET   RightTroll      
##  8 9.06e17 10_GOP '@Se… English  10/1/2017 2… <NA>      RightTroll      
##  9 9.06e17 10_GOP As m… English  10/1/2017 3… <NA>      RightTroll      
## 10 9.06e17 10_GOP Afte… English  10/1/2017 3… <NA>      RightTroll      
## # … with 243,881 more rows, and 1 more variable: tweet_id <dbl>

Let’s take stock of the function call above. The df object is the first item called, followed by two functions piped with %>%. Note that the order of these operations is top-down, meaning that the select() function is called before the rename() function. Because the data was the first item called, we do not need to continue calling it in each function. Finally, note that you do not need to rename all variables in you data set, you may only choose to do such for a small subset as seen above.

dplyr: How do I filter and arrange my observations?

Say you want to filter out data, how would you proceed with this task? While select() and rename() operate with variables, filter() and arrange() are primarily intended for rows or observations. Let’s continue expanding on the previous function call by filtering out observations:

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type == "RETWEET" & language == "English")

## # A tibble: 102,029 x 8
##        uid handle tweet language publish_date post_type account_category
##      <dbl> <chr>  <chr> <chr>    <chr>        <chr>     <chr>           
##  1 9.06e17 10_GOP Daug… English  10/1/2017 2… RETWEET   RightTroll      
##  2 9.06e17 10_GOP 19,0… English  10/1/2017 2… RETWEET   RightTroll      
##  3 9.06e17 10_GOP 🐝🐝🐝 … English  10/1/2017 2… RETWEET   RightTroll      
##  4 9.06e17 10_GOP BREA… English  10/11/2017 … RETWEET   RightTroll      
##  5 9.06e17 10_GOP Beca… English  10/11/2017 … RETWEET   RightTroll      
##  6 9.06e17 10_GOP I am… English  10/11/2017 … RETWEET   RightTroll      
##  7 9.06e17 10_GOP Do y… English  10/12/2017 … RETWEET   RightTroll      
##  8 9.06e17 10_GOP Netw… English  10/12/2017 … RETWEET   RightTroll      
##  9 9.06e17 10_GOP "To … English  10/12/2017 … RETWEET   RightTroll      
## 10 9.06e17 10_GOP "So … English  10/12/2017 … RETWEET   RightTroll      
## # … with 102,019 more rows, and 1 more variable: tweet_id <dbl>

Once again, we begin with calling the data and piping functions. Notice that the filter() function takes multiple conditions combined with & or |. Here we have filtered for tweets that are in non original content, re-tweets, and were the body of the text is in English. Now let’s proceed by sorting this subset these by posting date:

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type == "RETWEET" & language == "English") %>%
  arrange(publish_date)

## # A tibble: 102,029 x 8
##       uid handle tweet language publish_date post_type account_category
##     <dbl> <chr>  <chr> <chr>    <chr>        <chr>     <chr>           
##  1 1.65e9 ALECM… "Pus… English  1/1/2016 18… RETWEET   LeftTroll       
##  2 1.65e9 ANTON… None… English  1/1/2016 18… RETWEET   LeftTroll       
##  3 1.65e9 ANTON… Jame… English  1/1/2016 18… RETWEET   LeftTroll       
##  4 1.66e9 ADRGR… Gosp… English  1/1/2016 18… RETWEET   LeftTroll       
##  5 1.66e9 ADRGR… If K… English  1/1/2016 18… RETWEET   LeftTroll       
##  6 1.66e9 ADRGR… My b… English  1/1/2016 18… RETWEET   LeftTroll       
##  7 1.68e9 AMELI… "Joi… English  1/1/2017 0:… RETWEET   RightTroll      
##  8 1.68e9 AMELI… '@PM… English  1/1/2017 0:… RETWEET   RightTroll      
##  9 1.68e9 AMELI… #POS… English  1/1/2017 0:… RETWEET   RightTroll      
## 10 1.68e9 AMELI… #Bes… English  1/1/2017 0:… RETWEET   RightTroll      
## # … with 102,019 more rows, and 1 more variable: tweet_id <dbl>

The returned data is sorted from earliest tweet to newest. Alternatively, we could change the direction by using desc() to sort variables in descending order:

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type == "RETWEET" & language == "English") %>%
  arrange(desc(publish_date))

## # A tibble: 102,029 x 8
##        uid handle tweet language publish_date post_type account_category
##      <dbl> <chr>  <chr> <chr>    <chr>        <chr>     <chr>           
##  1 9.06e17 10_GOP The … English  9/9/2017 22… RETWEET   RightTroll      
##  2 9.06e17 10_GOP Ther… English  9/9/2017 19… RETWEET   RightTroll      
##  3 9.06e17 10_GOP Unit… English  9/9/2017 19… RETWEET   RightTroll      
##  4 9.06e17 10_GOP Than… English  9/9/2017 19… RETWEET   RightTroll      
##  5 9.06e17 10_GOP Chur… English  9/9/2017 1:… RETWEET   RightTroll      
##  6 9.06e17 10_GOP #IfI… English  9/9/2017 1:… RETWEET   RightTroll      
##  7 1.69e 9 ARCHI… "Dr.… English  9/9/2016 9:… RETWEET   RightTroll      
##  8 1.69e 9 ARCHI… #DEL… English  9/9/2016 9:… RETWEET   RightTroll      
##  9 1.69e 9 ARCHI… PATH… English  9/9/2016 9:… RETWEET   RightTroll      
## 10 1.69e 9 ARCHI… Even… English  9/9/2016 9:… RETWEET   RightTroll      
## # … with 102,019 more rows, and 1 more variable: tweet_id <dbl>

dplyr: What if I need to add new variables to my data frame?

In addition to selecting existing columns and manipulating rows, you may want to add new variables to a data set. This is were mutate() may come in handy. The function allows you to add data to the data set and keep all other columns. As opposed to transmute() which allows you to create new columns, but discards all other variables.

Say you want to create a new column to indicate the whether or not tweet was sent out by an account tasked with creating political content, namely, those accounts tagged “LeftTroll” and “RightTroll”. To do such, use the mutate() and the if_else() function, which allows you to check whether or not an observation passes a logical test or otherwise.

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type == "RETWEET" & language == "English") %>%
  arrange(
    desc(publish_date)
    ) %>%
  mutate(political = if_else(account_category == "LeftTroll" | account_category == "RightTroll",
                             "Political", "Not Political")
         )

## # A tibble: 102,029 x 9
##        uid handle tweet language publish_date post_type account_category
##      <dbl> <chr>  <chr> <chr>    <chr>        <chr>     <chr>           
##  1 9.06e17 10_GOP The … English  9/9/2017 22… RETWEET   RightTroll      
##  2 9.06e17 10_GOP Ther… English  9/9/2017 19… RETWEET   RightTroll      
##  3 9.06e17 10_GOP Unit… English  9/9/2017 19… RETWEET   RightTroll      
##  4 9.06e17 10_GOP Than… English  9/9/2017 19… RETWEET   RightTroll      
##  5 9.06e17 10_GOP Chur… English  9/9/2017 1:… RETWEET   RightTroll      
##  6 9.06e17 10_GOP #IfI… English  9/9/2017 1:… RETWEET   RightTroll      
##  7 1.69e 9 ARCHI… "Dr.… English  9/9/2016 9:… RETWEET   RightTroll      
##  8 1.69e 9 ARCHI… #DEL… English  9/9/2016 9:… RETWEET   RightTroll      
##  9 1.69e 9 ARCHI… PATH… English  9/9/2016 9:… RETWEET   RightTroll      
## 10 1.69e 9 ARCHI… Even… English  9/9/2016 9:… RETWEET   RightTroll      
## # … with 102,019 more rows, and 2 more variables: tweet_id <dbl>,
## #   political <chr>

dplyr: How can I get a summary of my data?

Now that we have selected relevant columns, filtered out extraneous information, and additional information required for our analysis, let’s run some summaries on the daily amount of re-tweets. To get daily results, let’s bin observations into groups using the group_by() function, then run a summary using summarise().

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type == "RETWEET" & language == "English") %>%
  arrange(
    desc(publish_date)
    ) %>%
  mutate(political = if_else(account_category == "LeftTroll" | account_category == "RightTroll",
                             "Political", "Not Political")
         ) %>%
  group_by(political) %>%
  summarise(political_volume=n())

## # A tibble: 2 x 2
##   political     political_volume
##   <chr>                    <int>
## 1 Not Political            19969
## 2 Political                82060

Once again, let’s take stock of the function calls above. We began with the a data set, which we then filtered for re-tweets in English, arranged from newest to oldest, created a new column with the day number for binning, grouped variables by political category, and counted for the volume in each category. Grouping enables you to run descriptive operations - such as counting number of observations, calculating mean, etc. - on a group of observations. Here, we used the n() function to count the number of observations for each daily group and assign those to the variable political.

dplyr: So What?

While this workflow may seem like a bit of work or not intuitive at first, keep in mind the flexibility it provides. For instance, consider how many parameters you would have to change in order to obtain the same summary but for original content tweets (e.g., non-re-tweets) and not in English. Like so:

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type != "RETWEET" & language != "English") %>%
  arrange(
    desc(publish_date)
    ) %>%
  mutate(political = if_else(account_category == "LeftTroll" | account_category == "RightTroll",
                             "Political", "Not Political")
         ) %>%
  group_by(political) %>%
  summarise(political_volume=n())

## # A tibble: 2 x 2
##   political     political_volume
##   <chr>                    <int>
## 1 Not Political              323
## 2 Political                   55

Note the level of flexibility gained from piping functions. This is to say, you can make minimal modifications to each function rather than to the object. Additionally, rearranging the functions will produce different results as needed.

Working with Text in stringr

As a new or seasoned analysts, you are bound to discover sooner rather than later that working with text data is a necessary evil. While most data analysis methodologies are intended to work with continuous data, the rapid growth in computational power has enabled analysts begin looking at other, messier, and richer data sources. Among these are documents, images, videos, etc. Here we will focus on working with text precisely because it is a necessary evil, but also because it unlocks a new realm of information for you to explore.

While stringr is not the only library for dealing with text, or in fact analyzing text (e.g., word frequencies, topic modeling, natural language processing, etc.), it is an integral part of the tidyverse and it includes multiple useful functions for dealing with text. However, before we move forward, let’s define strings, which are the type of text data best suited for working with stringr. Strings are text found between either single or double quotes. While there is no real difference between these, stick to using double quotes. Multiple strings are usually stored in a character atomic vector, much like the content variable in our df data set. Let’s call for the first string in this vector using the $ and [ accessors:

df$content[1]

## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"

Note that the strings may contain numbers, special characters, spaces, line breaks, and other features in addition to words. All of these characters are valid as long as they are surrounded by a single or double quote. Because of the complexity of strings, we will cover the following methods for dealing with strings:

Measuring
Combining and subsetting
Setting case
Matching with regular expressions
Detect and extract matches

To do such, we will cover the following set of functions:

Table 4: Summary of stringr functions

Function	Definition
`str_length`	Measure the length of a string
`str_c`	Combine strings
`str_sub`	Subset strings
`str_to_lower`	Transforms all words in a string to lower case
`str_to_upper`	Transforms all words in a string to upper case
`str_to_title`	Transforms all words in a string to title case
`str_view`	Show the first regular expression match in a string
`str_view_all`	Show all regular expression matches in a string
`str_detect`	Detect the presence or absence of a pattern in string
`str_count`	Count the number of matches in a string
`str_extract`	Extract matching patterns from a string
`str_extract_all`	Extract matching patterns from a string

Once again, you should not aim to memorize each and all functions in the stringr library. Remember that you can access help within R by pairing the ? with the function name in the console, or by checking out stringr’s site or cheat sheet.

stringr: What if the goal is determining the length of a string?

One of the key features of stringr is its consistency in naming functions. The str_ preface ahead of any additional verb. To measure the length of strings, we will use str_length like so:

str_length(df$content[1])

## [1] 156

The output from this function call tells us that the first tweet in our df data set is 156 characters in length. Note that this will include spaces in between words. As such you may need to enlist the gsub() function from base R, which enables you to find a pattern and substitute it (more of this later on when we cover regular expressions).

str_length(gsub(pattern = " ",
                replacement = "",
                df$content[1]))

## [1] 133

Now let’s tie this into the dplyr workflow previously used, like so:

df %>%
    select(external_author_id,
         author,
         content,
         language,
         publish_date,
         post_type,
         account_category,
         tweet_id) %>%
  rename(uid = external_author_id,
         handle = author,
         tweet = content) %>%
  filter(post_type != "RETWEET" & language != "English") %>%
  arrange(
    desc(publish_date)
    ) %>%
  mutate(political = if_else(account_category == "LeftTroll" | account_category == "RightTroll",
                             "Political", "Not Political"),
         string_length = str_length(tweet),
         string_length_trimmed = str_length(gsub(pattern = " ",
                replacement = "",
                tweet)
         )
         )

## # A tibble: 378 x 11
##       uid handle tweet language publish_date post_type account_category
##     <dbl> <chr>  <chr> <chr>    <chr>        <chr>     <chr>           
##  1 8.76e7 ANZGRI а эт… Russian  9/9/2017 15… QUOTE_TW… NonEnglish      
##  2 8.76e7 ANZGRI но е… Russian  9/8/2017 11… QUOTE_TW… NonEnglish      
##  3 8.76e7 ANZGRI откр… Russian  9/6/2017 7:… QUOTE_TW… NonEnglish      
##  4 8.76e7 ANZGRI Бля.… Russian  9/6/2015 13… QUOTE_TW… NonEnglish      
##  5 1.65e9 ANTON… Gove… Portugu… 9/4/2016 14… QUOTE_TW… LeftTroll       
##  6 1.67e9 ADAMC… Gove… Portugu… 9/4/2016 14… QUOTE_TW… LeftTroll       
##  7 8.76e7 ANZGRI типа… Ukraini… 9/4/2016 0:… QUOTE_TW… NonEnglish      
##  8 8.76e7 ANZGRI Стаб… Russian  9/30/2015 1… QUOTE_TW… NonEnglish      
##  9 8.76e7 ANZGRI http… Russian  9/3/2015 11… QUOTE_TW… NonEnglish      
## 10 8.76e7 ANZGRI как … Russian  9/29/2017 6… QUOTE_TW… NonEnglish      
## # … with 368 more rows, and 4 more variables: tweet_id <dbl>,
## #   political <chr>, string_length <int>, string_length_trimmed <int>

Note that we can continue building on the dplyr workflow used above; however, we will start over from here on out.

stringr: Can I combine or subset strings?

What if I want to modify a string? Combining strings is rather useful when you want to add features to a vector. For example, the original user handle variable is usually prefixed with an (???) symbol. However, in our data set this is not the case. In order to fix that, we could add an (???) symbol using the sr_c() function, which enables you to combine strings. For example:

df %>%
    select(author,
         content) %>%
  rename(handle = author,
         tweet = content) %>%
  mutate(string_length = str_length(tweet),
         string_length_trimmed = str_length(gsub(pattern = " ",
                replacement = "",
                tweet)
         ),
         handle = str_c("@", handle)
         )

## # A tibble: 243,891 x 4
##    handle  tweet                            string_length string_length_tr…
##    <chr>   <chr>                                    <int>             <int>
##  1 @10_GOP "\"We have a sitting Democrat U…           156               133
##  2 @10_GOP Marshawn Lynch arrives to game …           140               120
##  3 @10_GOP Daughter of fallen Navy Sailor …           143               125
##  4 @10_GOP JUST IN: President Trump dedica…           145               126
##  5 @10_GOP 19,000 RESPECTING our National …            83                77
##  6 @10_GOP "Dan Bongino: \"Nobody trolls l…            97                86
##  7 @10_GOP 🐝🐝🐝 https://t.co/MorL3AQW0z              27                26
##  8 @10_GOP '@SenatorMenendez @CarmenYulinC…           141               122
##  9 @10_GOP As much as I hate promoting CNN…           140               119
## 10 @10_GOP After the 'genocide' remark fro…           119               102
## # … with 243,881 more rows

Note that using mutate() and str_c() the string was modified in place. Additionally, we could remove that *(???) using the gsub() function. Instead let’s turn our attention to the second task in this sub-section, subsetting.

The str_sub() function allows you to extract parts of a string between a starting and ending character. Consider the following string:

hello world!

Here each character is placed long the length of the string. Meaning h is in position 1, e is in position 2, the first l is position 3, etc. So if we wanted to extract the world hello from the over all string, we could use the starting point (h) and the last letter of interest (o) as markers. Like so:

str_sub("hello world!", 1, 5)

## [1] "hello"

Alternatively, we could also extract the second word as long as we know the first and last character in the sequence:

str_sub("hello world!", 7, 11)

## [1] "world"

Let’s put this function to good use. Consider the following string:

df$publish_date[1]

## [1] "10/1/2017 19:58"

At what position in the string do we find the month? Notice that we can use str_sub() to extract data of interest from recurring patterns. This is to say, in this example we know the month will always fill the first one or two characters, separated from the day by a /, etc. Let’s extract the month using the workflow previously used and str_sub():

df %>%
    select(author,
         content,
         publish_date) %>%
  rename(handle = author,
         tweet = content) %>%
  mutate(month = gsub("/",
                      "",
                      str_sub(publish_date, 1, 2)))

## # A tibble: 243,891 x 4
##    handle tweet                                         publish_date  month
##    <chr>  <chr>                                         <chr>         <chr>
##  1 10_GOP "\"We have a sitting Democrat US Senator on … 10/1/2017 19… 10   
##  2 10_GOP Marshawn Lynch arrives to game in anti-Trump… 10/1/2017 22… 10   
##  3 10_GOP Daughter of fallen Navy Sailor delivers powe… 10/1/2017 22… 10   
##  4 10_GOP JUST IN: President Trump dedicates President… 10/1/2017 23… 10   
##  5 10_GOP 19,000 RESPECTING our National Anthem! #Stan… 10/1/2017 2:… 10   
##  6 10_GOP "Dan Bongino: \"Nobody trolls liberals bette… 10/1/2017 2:… 10   
##  7 10_GOP 🐝🐝🐝 https://t.co/MorL3AQW0z                10/1/2017 2:… 10   
##  8 10_GOP '@SenatorMenendez @CarmenYulinCruz Doesn't m… 10/1/2017 2:… 10   
##  9 10_GOP As much as I hate promoting CNN article, her… 10/1/2017 3:… 10   
## 10 10_GOP After the 'genocide' remark from San Juan Ma… 10/1/2017 3:… 10   
## # … with 243,881 more rows

stringr: How do I handle casing?

While stringr contains highly specialize functions for manipulating strings, it also contains fairly general purpose functions that allow you to standardize and clean you data a bit. It is not uncommon for text data to be in mismatching case, for instance “Chris loves stringr”, “Chris loves Stringr”, or even “chris loves stringr”. Unfortunately, the computer is unable to see that those three strings represent the same idea. Let’s test this:

string1 <- "Chris loves stringr"
string2 <- "Chris loves Stringr"
string3 <- "chris loves stringr"

string1 == string2

## [1] FALSE

string2 == string3

## [1] FALSE

string1 == string3

## [1] FALSE

Luckily, the stringr library contains three functions for dealing with mismatching string case:

str_to_lower(): Transforms all words in a string to lower case
str_to_upper(): Transforms all words in a string to upper case
str_to_title(): Transforms all words in a string to title case

Say for example we wanted to standardize our three test strings to upper case and test whether or not the data objects are equal:

string1 <- str_to_upper(string1)
string2 <- str_to_upper(string2)
string3 <- str_to_upper(string3)

string1 == string2

## [1] TRUE

string2 == string3

## [1] TRUE

string1 == string3

## [1] TRUE

Now consider, how would you standardize your handles? Title, upper, or lower case? Regardless of how you choose to do so, case transformations for a vector are a sure way to reduce case-sensitive errors:

df %>%
    select(author,
         content) %>%
  rename(handle = author,
         tweet = content) %>%
  mutate(handle = str_c("@", handle),
         handle = str_to_upper(handle))

## # A tibble: 243,891 x 2
##    handle  tweet                                                           
##    <chr>   <chr>                                                           
##  1 @10_GOP "\"We have a sitting Democrat US Senator on trial for corruptio…
##  2 @10_GOP Marshawn Lynch arrives to game in anti-Trump shirt. Judging by …
##  3 @10_GOP Daughter of fallen Navy Sailor delivers powerful monologue on a…
##  4 @10_GOP JUST IN: President Trump dedicates Presidents Cup golf tourname…
##  5 @10_GOP 19,000 RESPECTING our National Anthem! #StandForOurAnthem🇺🇸 htt…
##  6 @10_GOP "Dan Bongino: \"Nobody trolls liberals better than Donald Trump…
##  7 @10_GOP 🐝🐝🐝 https://t.co/MorL3AQW0z                                  
##  8 @10_GOP '@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN does…
##  9 @10_GOP As much as I hate promoting CNN article, here they are admittin…
## 10 @10_GOP After the 'genocide' remark from San Juan Mayor the narrative h…
## # … with 243,881 more rows

stringr: What is a regular expression? How do I handle it?

Regular expressions are arguably among the more useful tools for dealing with strings in R and other programming languages; however, many analysts shy away from them as they are among some of the most cryptic looking code. Gaston Sanchez, author of “Handling Strings with R”, succinctly defines regular expressions:

A regular expression is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of pattern. In other words, a regular expression is a set of symbols that describes a text pattern. More formally we say that a regular expression is a pattern that describes a set of strings... Because the term “regular expression” is rather long, most people use the word regex as a shortcut term.

In other words, regular expressions are coded patterns that describe pattern inside you code. Check out this cheat sheet for working with regular expressions. Before we dive into working on matching regular expressions in stringr, let’s revisit out data set. Let’s print out an observation for dissection:

df$content[1:5]

## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"                  
## [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  #BoycottNFL https://t.co/qDlFBGMeag"               
## [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"             
## [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"

What patterns do you observe? Let’s focus on the following two patterns:

Hashtags: These are strings prefixed by a pound sign (#).
Mentions: Strings prefixed by the at sign (@).

Because this data set was procured by querying Twitter, we know some, if not most, tweets should contain this hashtags. Let’s view when this character is present by enlisting the help of the str_view() function, which will show the first match for a specific pattern. Like so:

str_view(df$content[1:5], "#")

Suppose you wanted to see all hash tags in each string. To do such, use str_view_all() to show all matches for a pattern. Let’s test this changing our pattern to the pound sign prefix:

str_view_all(df$content[1:5], "#")

As you can see, the function works as expected. However, it is only able to match the predefined pattern. This is were regular expressions come in handy. The easiest to wrap you head around is ., which matches any character adjacent to the pattern, like so:

str_view_all(df$content[1:5], "#.")

This is an improvement, as we are now able to detect both the pound sign and one character adjacent to it. However, we are still unable to match full words. To do such, we can use the regular expression \\w, which is the character class for all word characters. In addition, let’s add the + quantifier, which tells the function to match for the pattern at least 1 time.

str_view_all(df$content[1:5], "#\\w+")

Hopefully, you are now beginning to see the application of regular expressions. However, if you are still a bit hesitant on how to use them, this is normal. Regular expressions take a while to master. Let’s practice some more by turning out attention to mentions. As previously mentioned, these are account handles prefixed by the “@” sign. Let’s try matching them with our previous code, only slightly modified to search for “@” instead of “#”:

str_view_all(df$content[1:5], "@\\w+")

Once again our pattern works! Hopefully at this point you are sold on the idea that regular expressions are versatile and open up a universe of data manipulation possibilities. Let’s move on from simply matching regular expression patterns and into slightly more complex tasks. In the next section, we will begin capitalizing on these matches to begin forming new variables and getting further insights on our data.

stringr: Can I detect if a pattern is present? If so, how do I extract it?

Now that you have some experience with pattern matching, let’s take a look at pattern detection. The str_detect() function tries to match a pattern and returns a logical vector for each observation from the input. For example, we can test whether or not a tweet has any mentions.

str_detect(df$content[1:5], "@\\w+")

## [1]  TRUE FALSE FALSE FALSE FALSE

Additionally, we can use the str_count() function to count the number of occurrences of a pattern in each string in a data set.

str_count(df$content[1:5], "@\\w+")

## [1] 1 0 0 0 0

A count observation vector can be stored as a new variable in the data set:

df %>%
    select(author,
         content) %>%
  rename(handle = author,
         tweet = content) %>%
  mutate(number_mentions = str_count(tweet,
                                     "@\\w+")
         )

## # A tibble: 243,891 x 3
##    handle tweet                                             number_mentions
##    <chr>  <chr>                                                       <int>
##  1 10_GOP "\"We have a sitting Democrat US Senator on tria…               1
##  2 10_GOP Marshawn Lynch arrives to game in anti-Trump shi…               0
##  3 10_GOP Daughter of fallen Navy Sailor delivers powerful…               0
##  4 10_GOP JUST IN: President Trump dedicates Presidents Cu…               0
##  5 10_GOP 19,000 RESPECTING our National Anthem! #StandFor…               0
##  6 10_GOP "Dan Bongino: \"Nobody trolls liberals better th…               0
##  7 10_GOP 🐝🐝🐝 https://t.co/MorL3AQW0z                                  0
##  8 10_GOP '@SenatorMenendez @CarmenYulinCruz Doesn't matte…               2
##  9 10_GOP As much as I hate promoting CNN article, here th…               0
## 10 10_GOP After the 'genocide' remark from San Juan Mayor …               1
## # … with 243,881 more rows

In addition to pattern detection, stringr provides various functions for extracting the actual text of a match. For example, str_extract() and str_extract_all() will snip matching text from a string. For example, let’s assume we want to extract the mentions in order to gain a better understanding of who may be the most retweeted accounts within our sample set. To do such, let’s use str_extract(), like so:

df %>%
    select(author,
         content) %>%
  rename(handle = author,
         tweet = content) %>%
  mutate(first_mention = str_extract(tweet,
                                     "@(\\w+)"),
         all_mentions  = str_extract_all(tweet,
                                         "@(\\w+)"))

## # A tibble: 243,891 x 4
##    handle tweet                                 first_mention  all_mentions
##    <chr>  <chr>                                 <chr>          <list>      
##  1 10_GOP "\"We have a sitting Democrat US Sen… @nedryun       <chr [1]>   
##  2 10_GOP Marshawn Lynch arrives to game in an… <NA>           <chr [0]>   
##  3 10_GOP Daughter of fallen Navy Sailor deliv… <NA>           <chr [0]>   
##  4 10_GOP JUST IN: President Trump dedicates P… <NA>           <chr [0]>   
##  5 10_GOP 19,000 RESPECTING our National Anthe… <NA>           <chr [0]>   
##  6 10_GOP "Dan Bongino: \"Nobody trolls libera… <NA>           <chr [0]>   
##  7 10_GOP 🐝🐝🐝 https://t.co/MorL3AQW0z        <NA>           <chr [0]>   
##  8 10_GOP '@SenatorMenendez @CarmenYulinCruz D… @SenatorMenen… <chr [2]>   
##  9 10_GOP As much as I hate promoting CNN arti… <NA>           <chr [0]>   
## 10 10_GOP After the 'genocide' remark from San… @CNN           <chr [1]>   
## # … with 243,881 more rows

Upon inspecting this the newly created columns, you should notice that the the first_mention variable contains a character as expected, which contains the first match. Similarly, the all_mentions variable contains the returns from pattern matching. However, this variable contains a list column, which are lists contained in a data frame. Remember that a data frame is a list of vectors, so while most variables are atomic (containing the same type of data), a list is also a vector and as such can be used in a data frame a long as it matches all other variables in length. List columns pose a specific set of challenges; however, they are really useful for storing nested information. We will work with them in the tidyr section. For now, let’s focus on answering the following: Who are the accounts most commonly referenced first by the corpus of tweets?

df %>%
  rename(tweet = content) %>%
  mutate(first_mention = str_extract(tweet, "@(\\w+)")) %>%
  group_by(first_mention) %>%
  summarise(most_popular = n()) %>%
  na.omit() %>% # Removes missing observations
  arrange(desc(most_popular))

## # A tibble: 15,339 x 2
##    first_mention      most_popular
##    <chr>                     <int>
##  1 @realDonaldTrump           1209
##  2 @midnight                  1029
##  3 @YouTube                    651
##  4 @HillaryClinton             380
##  5 @POTUS                      376
##  6 @rus_improvisation          300
##  7 @FoxNews                    204
##  8 @CNN                        146
##  9 @WorldOfHashtags            122
## 10 @deray                      109
## # … with 15,329 more rows

stringr: So What?

In this section we expanded on the workflow we began working with in the dplyr subsection. However, here the focus was on working with strings. Hopefully, this brief introduction to working with working with text in stringr has inspired you to work with these data points.

Tidying data with tidyr

The last package we will explore in this tutorial is tidyr. Luckily, this will be a relatively short process as this library contains one relevant function for this data set:²

Table 5: Summary of tidyr functions

Function	Definition
`unnest`	Make each element in a list column its own row

tidyr: How do I handle list columns?

First let’s create a situation were we will need these functions using both stringr and dplyr functions:

df %>%
  mutate(all_mentions  = str_extract_all(content,
                                         "@(\\w+)")) %>%
  select(author, tweet_id, all_mentions) %>%
  na.omit()

## # A tibble: 243,891 x 3
##    author tweet_id all_mentions
##    <chr>     <dbl> <list>      
##  1 10_GOP  9.15e17 <chr [1]>   
##  2 10_GOP  9.15e17 <chr [0]>   
##  3 10_GOP  9.15e17 <chr [0]>   
##  4 10_GOP  9.15e17 <chr [0]>   
##  5 10_GOP  9.14e17 <chr [0]>   
##  6 10_GOP  9.14e17 <chr [0]>   
##  7 10_GOP  9.14e17 <chr [0]>   
##  8 10_GOP  9.14e17 <chr [2]>   
##  9 10_GOP  9.14e17 <chr [0]>   
## 10 10_GOP  9.14e17 <chr [1]>   
## # … with 243,881 more rows

List columns are a great place to use tidyr, first we can extract the values in the list column and make each element of this its own row with unnest():

df %>%
  mutate(all_mentions  = str_extract_all(content,
                                         "@(\\w+)")) %>%
  select(author, tweet_id, publish_date, all_mentions) %>%
  unnest()

## # A tibble: 60,340 x 4
##    author tweet_id publish_date     all_mentions    
##    <chr>     <dbl> <chr>            <chr>           
##  1 10_GOP  9.15e17 10/1/2017 19:58  @nedryun        
##  2 10_GOP  9.14e17 10/1/2017 2:52   @SenatorMenendez
##  3 10_GOP  9.14e17 10/1/2017 2:52   @CarmenYulinCruz
##  4 10_GOP  9.14e17 10/1/2017 3:51   @CNN            
##  5 10_GOP  9.14e17 10/1/2017 3:58   @CNN            
##  6 10_GOP  9.14e17 10/1/2017 4:11   @thehill        
##  7 10_GOP  9.18e17 10/10/2017 21:59 @MichelleObama  
##  8 10_GOP  9.18e17 10/10/2017 22:06 @MichelleObama  
##  9 10_GOP  9.18e17 10/10/2017 23:42 @FLOTUS         
## 10 10_GOP  9.18e17 10/11/2017 19:16 @Breaking911    
## # … with 60,330 more rows

Note that the second and third observation have the same publish_date, tweet_id, and author; yet, the all_mentions observations do not match. One way to read this is “10_GOP tweeted on October 1st both at SenatorMenendez and CarmenYulinCruz”. This data format is known as long data and it has many advantages, Hadley Wickham, the author of this package, highlights some crucial points on why you may want to have you data in long form, among those are:

Long data makes it easier to summarize information and dplyr makes this easy to accomplish:

df %>%
  mutate(all_mentions  = str_extract_all(content,
                                         "@(\\w+)")) %>%
  select(author, tweet_id, publish_date, all_mentions) %>%
  unnest() %>%
  group_by(all_mentions) %>%
  summarize(count=n()) %>%
  arrange(desc(count))

## # A tibble: 25,272 x 2
##    all_mentions       count
##    <chr>              <int>
##  1 @realDonaldTrump    1741
##  2 @midnight           1063
##  3 @YouTube             752
##  4 @POTUS               646
##  5 @HillaryClinton      626
##  6 @FoxNews             377
##  7 @rus_improvisation   344
##  8 @CNN                 278
##  9 @deray               160
## 10 @WorldOfHashtags     151
## # … with 25,262 more rows

Long form is preferred by many other packages. The example immediately above highlights this; however, other forms of data analysis work best with long form data. For example, the social network analysis library igraph contains various functions for working with long form relational data.

tidyr: So What?

tidyr is one of the most important packages for working with a data frame. While dplyr serves to manipulate the content of the data frame, tidyr focuses on the shape and format of your data. Enabling you to manipulate your data into long and wide formats, or leveraging the power of nested values in list columns.

Parting Thoughs and Additional Resources

This document was intended to serve as a top level guide introducing you to wrangling data with the tidyverse and the canopy of packages the it encompasses. This means that many have not been covered in detail. However, hopefully you are now able to see that tidyverse grammar is intuitive; furthermore, might be inspired to dive deep into this universe of libraries designed to work in concert.

Using these building blocks, you should now be able to begin exploring the tidyverse ecosystem. Keep the following set of resources on your back-pocket:

R for Data Science from Hadley Wickham. An essential read for those invested in learning data science with R. The author is the Chief Scientist at RStudio and principal author of many R packages.
Stat 545 course from the University of British Columbia, which is an online course on data wrangling, exploration and analysis with R.
Gaston Sanchez’s chapter on stingr from “Handling Strings with R”
An thorough discussion on “Tidy data”

References

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.

Note that collecting Twitter data through R is possible the rtweet package.↩
The gather() and spread() functions are also crucial, but did not fit into the objective for this document.↩