The goal of this lesson is to provide you with foundational skills for handling text data. By the end, you’ll be able to match, search, manipulate, and analyze text in more powerful ways and larger quantities than you could before.
Before we get started, let’s make sure our environment is ready for work and we have all the packages that we’ll use installed.
The following function isn’t part of the lesson plan, but we’ll be using it to print()
data.frame
s in a more human-readable way than printing to the console.
In programming, we usually refer to character sequences as strings. In R, we formally refer to strings as being of type character
, but we’ll typically refer to them as strings anyways.
To make strings, we wrap text inside of either double or single quotation marks, like so…
## [1] "This is how we make a string with double quotes."
## [1] "This is how we make a string with single quotes."
The easiest way to include qutotation marks inside of a string is to simply use different quotes to wrap our text, like so…
inner_double_quotes <- 'This is how we make a string that CONTAINS "DOUBLE QUOTES".'
inner_double_quotes
## [1] "This is how we make a string that CONTAINS \"DOUBLE QUOTES\"."
inner_single_quotes <- "This is how we make a string that CONTAINS 'SINGLE QUOTES'."
inner_single_quotes
## [1] "This is how we make a string that CONTAINS 'SINGLE QUOTES'."
Line breaks, such as those created when you press the Enter or Return key, are represented by "\n"
.
## [1] "\nThis string\nhas line\nbreaks.\n"
As you can see, the line breaks are replaced with "\n"
. If we were to directly insert "\n"
instead of pressing Enter, it would look like so…
If we want to print()
a string so that the text is rendered in a human-readable format, we can concatenate all the sequence using cat()
.
##
## This string
## has line
## breaks.
Similarly, tabs are represented by "\t"
.
## This string has tabs.
Before we move on, let’s create a character
vector
named practice_tweets
that we’ll use for the next exercises.
practice_tweets <- c(
"New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k",
"Denmark to expand anti-ISIS military mission with 400 Elite Soldiers.... https://t.co/ugFmV322s0",
"Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah",
"#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants.",
"Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side.",
"RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT"
)
The "\"
used in "\n"
and "\t"
is a backslash. Backslashes are used to escape characters, which is how we can trigger alternative interpretation of subsequent characters. The concept of escape characters and other special character gives us powerful ways manipulate text data using Regular Expressions, or RegEx. We’ll harness of the power of regular expressions using the {stringr
} package that loads when we call library(tidyverse)
.
We use regular expressions to match character sequences. Using stringr::str_view()
, we can explore exactly how text is matched.
The simplest pattern
s to match are those that can be interpreted literally. By literal, I mean what you see is what you get.
For example, let’s say we want find all the occurrences of "ISIS"
in practice_tweets
. Our pattern
will simply be "ISIS"
.
We can use literal pattern
s to match letters, numbers, and many other characters such as "#"
, and "/"
.
Notice that str_view()
only matches the first occurrence of our pattern
s. In order to match all occurrences, {stringr
} uses a convention where the function name is suffixed with *_all()
, like so…
If we want to match multiple potential characters, we can separate individual pattern
s with "|"
, pronounced “or”.
For example, if we wanted to match all occurrences of either "#"
or "/"
, we can use "#|/"
like so…
We can also enclose optional individual character patterns inside square-brackets for the same effect, like so…
There are many sequences that are used often enough that they have special, built-in patterns. The ones we’ll use most often are to capture digits, punctuation, and letters.
If we want to match any number, we use the pattern
"[0-9]"
.
If we want to match any punctuation, we use the pattern
"[:punct:]"
.
If we want to match any letter, we use the pattern "[A-z]"
.
If we want to match only lower-case letters or only upper-case letters, we use "[a-z]"
or "[A-Z]"
, respectively, like so…
We’ve been using str_view_all()
in order match all occurrences of the pattern
s we provide, but this results in multiple, separate matches. Note that the outputs above contain individual boxes for each matched character.
However, we typically want to match patterns that contain multiple characters, or even an unknown amount of characters, for which we need to use some special characters.
You may have noticed that when we try to match the pattern "ISIS"
, we don’t match "IS"
.
If we think of "ISIS"
as two back-to-back occurrenes of the pattern
"IS"
then we solve this problem by matching "IS"
if it occurs once or more.
We can do this by enclosing "IS"
inside "()"
, which we refer to as a group.
We then append "+"
at the end of the group, making "(IS)+"
, which we can then use to match "IS"
or "ISIS"
.
However, we still have a problem: we’re still not matching "ISIL"
.
If you recall, we were able to match "#"
or "/"
by using either square-brackets or "|"
. We can use the same strategy here to match either "S"
or "L"
, like so…
Now our pattern matches all occurrences of "IS"
, "ISIS"
or "ISIL"
.
If we want to match the boundary of a word, we use \\b
and if we want to match any word character ([A-z0-9_]
), we use \\w
.
With that in mind, lets create a basic_url_regex
variable and a basic_hashtag_regex
variable.
Since we’re usually using data.frame
s to manage our data in R, let’s also create a tibble
data.frame
without our practice_tweets
in a column named text
.
text |
---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT |
If we want to to whether a pattern
is matched, we can get TRUE
or FALSE
using str_detect()
Let’s add a column to practice_tweets_df
called has_url
.
practice_tweets_df %>%
mutate(has_url = str_detect(string = text,
pattern = basic_url_regex)) %>%
prettify()
text | has_url |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | TRUE |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | TRUE |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | FALSE |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | FALSE |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | FALSE |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | TRUE |
If we want to extract a pattern
from a string, we use str_extract()
.
Let’s add a column named url
that contains the result of str_extract()
practice_tweets_df %>%
mutate(url = str_extract(string = text,
pattern = basic_url_regex)) %>%
prettify()
text | url |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | https://t.co/4ZsnI05o2k |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | https://t.co/ugFmV322s0 |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | NA |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | NA |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | NA |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | https://t.co/dNWvurivFT |
str_extract()
only returns the first sequence that matches pattern
. If we want to extract all matches, we use str_extract_all()
.
Let’s add a column named hashtags
using str_extract_all()
.
practice_tweets_df %>%
mutate(hashtags = str_extract_all(string = text,
pattern = basic_hashtag_regex)) %>%
prettify()
text | hashtags |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | #ISIS |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | character(0) |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | c(“#Iraqi”, “#Fallujah”) |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | #Iraqi |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | character(0) |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | c(“#ISIS”, “#Raqqa”, “#Syria”, “#ISIL”) |
Since str_extract_all()
is designed to return multiple matches, it returns a list
by default. Note that that list
columns are different from most columns, which are usually atomic
vector
s and require different approaches to handle.
practice_tweets_df %>%
mutate(hashtags = str_extract_all(string = text,
pattern = basic_hashtag_regex))
## # A tibble: 6 x 2
## text hashtags
## <chr> <list>
## 1 New #ISIS media release to be on air soon. https://t.co/4ZsnI05~ <chr [1~
## 2 Denmark to expand anti-ISIS military mission with 400 Elite Sol~ <chr [0~
## 3 Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah <chr [2~
## 4 #Iraqi Army claims: Repelled IS assault in Northern Baiji and k~ <chr [1~
## 5 Nusra used to have a 5 to 7 km frontline against IS in North-Al~ <chr [0~
## 6 RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raq~ <chr [4~
If we want to remove matched patterns, we use str_remove()
and str_remove_all()
.
practice_tweets_df %>%
mutate(no_url = str_remove_all(string = text,
pattern = basic_url_regex)) %>%
prettify()
text | no_url |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | New #ISIS media release to be on air soon. |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | #Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL |
practice_tweets_df %>%
mutate(no_punct = str_remove_all(string = text,
pattern = "[:punct:]")) %>%
prettify()
text | no_punct |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | New ISIS media release to be on air soon httpstco4ZsnI05o2k |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | Denmark to expand antiISIS military mission with 400 Elite Soldiers httpstcougFmV322s0 |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | Nearly 40 Iraqi Soldiers killed by ISIS in NE Fallujah |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | Iraqi Army claims Repelled IS assault in Northern Baiji and killed 62 Militants |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | Nusra used to have a 5 to 7 km frontline against IS in NorthAleppo IS could never advance from that side |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | RT Raqqasl1 ISIS cover books for the 3 Third grade 3 Raqqa Syria ISIL httpstcodNWvurivFT |
If we want to combine separate strings, into a single character
vector
, we use str_c()
(or paste0()
).
We can provide an argument to sep=
indiciating a character we’d like to stick insert between each of the strings we provide.
## [1] "\\bhttp.*\\b|[:punct:]"
practice_tweets_df %>%
mutate(no_urls_or_punct = str_remove_all(string = text,
pattern = url_or_punct_regex)) %>%
prettify()
text | no_urls_or_punct |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | New ISIS media release to be on air soon |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | Denmark to expand antiISIS military mission with 400 Elite Soldiers |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | Nearly 40 Iraqi Soldiers killed by ISIS in NE Fallujah |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | Iraqi Army claims Repelled IS assault in Northern Baiji and killed 62 Militants |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | Nusra used to have a 5 to 7 km frontline against IS in NorthAleppo IS could never advance from that side |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | RT Raqqasl1 ISIS cover books for the 3 Third grade 3 Raqqa Syria ISIL |
Whitespace inside of strings is something we’ll need to handle often. To demonstrate, we’ll add some more columns to practice_tweets_df
and assigned the result to a variable named excess_whitespace_df
.
practice_tweets_df
rename()
the text
column to original_text
mutate()
to add a column with excess_whitespace
by str_replace_all()
inserting " "
at the start, end, and anywere there’s already a space in original_text
.filter()
rows, only keep those where excess_whitespace
has less than 100
characters.excess_whitespace_df
to a list
with as.list()
, print()
ing it in a way that our browsers won’t remove the excess whitespaceexcess_whitespace_df <- practice_tweets_df %>% # Step 1.
rename(original_text = text) %>% # 2.
mutate(excess_whitespace = str_replace_all(original_text, # 3.
"^|$|\\s", " ")) %>% # 4.
filter(nchar(excess_whitespace) < 100) # 5.
as.list(excess_whitespace_df) # 6.
## $original_text
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"
##
## $excess_whitespace
## [1] " New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k "
## [2] " Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah "
Now let’s add more columns so we can see the effects of various techniques to remove excess_whitespace
.
excess_whitespace_df
mutate()
to add columns that
str_squish()
the text to remove excess whitespace inside the stringstr_trim()
the text to remove whitespace and the start and end of the stringstr_trim()
and str_squish()
the stringstr_remove_all()
with pattern=\\s+
removeing all whitespace anywhere in the stringlist
with as.list()
, print()
ing it in a way that our browsers won’t remove the excess whitespace.excess_whitespace_df %>% # Step 1.
mutate(squished = str_squish(string = excess_whitespace), # 2a.
trimmed = str_trim(string = excess_whitespace), # 2b.
squished_and_trimed = str_trim(str_squish(excess_whitespace)), # 2c.
no_whitespace = str_remove_all(string = excess_whitespace, # 2d.
pattern = "\\s+")) %>%
as.list() # 3.
## $original_text
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"
##
## $excess_whitespace
## [1] " New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k "
## [2] " Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah "
##
## $squished
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"
##
## $trimmed
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"
##
## $squished_and_trimed
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"
##
## $no_whitespace
## [1] "New#ISISmediareleasetobeonairsoon.https://t.co/4ZsnI05o2k"
## [2] "Nearly40#IraqiSoldierskilledbyISISinNE#Fallujah"
Since you just saw str_replace_all()
, let’s see some more examples of how we can replace character sequences in bulk.
practice_tweets_df %>%
mutate(urls_replaced = str_replace_all(string = text,
pattern = basic_url_regex,
replacement = "{{THIS WAS A URL}}")) %>%
prettify()
text | urls_replaced |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | New #ISIS media release to be on air soon. {{THIS WAS A URL}} |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. {{THIS WAS A URL}} |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | #Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL {{THIS WAS A URL}} |
When dealing with many complicated RegEx patterns, it’s often easier to set up a single variable to str_replace_all()
pattern
s simultaneously.
We can do this use using a named vector
where the names are the pattern
s we want to match and the values are their replacement
s
Here’s one called tagger_regex
.
tagger_regex <- c(
c("(I[SL])+|[Ii]slamic.?[Ss]tate" = "{{ISIS}}"), # tag ISIS
c("#(\\w|_)+\\b" = "{{HASHTAG}}"), # tag hashtags
c("@(\\w|_)+\\b" = "{{SCREEN_NAME}}"), # tag screen names
c("\\bhttp.*?(\\s|$)" = "{{URL}}"), # tag URLs
c("@|#" = "") # remove @ and #
)
tagger_regex
## (I[SL])+|[Ii]slamic.?[Ss]tate #(\\w|_)+\\b
## "{{ISIS}}" "{{HASHTAG}}"
## @(\\w|_)+\\b \\bhttp.*?(\\s|$)
## "{{SCREEN_NAME}}" "{{URL}}"
## @|#
## ""
## # A tibble: 5 x 2
## pattern replacement
## <chr> <chr>
## 1 (I[SL])+|[Ii]slamic.?[Ss]tate {{ISIS}}
## 2 "#(\\w|_)+\\b" {{HASHTAG}}
## 3 "@(\\w|_)+\\b" {{SCREEN_NAME}}
## 4 "\\bhttp.*?(\\s|$)" {{URL}}
## 5 @|# ""
Using tagger_regex
let’s standardize references to ISIS, hashtags, screen names and URLs, while also dropping @
and #
.
practice_tweets_df %>%
mutate(standardized = str_replace_all(string = text,
pattern = tagger_regex)) %>%
prettify()
text | standardized |
---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | New {{ISIS}} media release to be on air soon. {{URL}} |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | Denmark to expand anti-{{ISIS}} military mission with 400 Elite Soldiers…. {{URL}} |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | Nearly 40 {{HASHTAG}} Soldiers killed by {{ISIS}} in NE {{HASHTAG}} |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | {{HASHTAG}} Army claims: Repelled {{ISIS}} assault in Northern Baiji and killed 62 Militants. |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | Nusra used to have a 5 to 7 km frontline against {{ISIS}} in North-Aleppo. {{ISIS}} could never advance from that side. |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | RT {{SCREEN_NAME}}: {{ISIS}} cover books for the 3 Third grade (3) {{HASHTAG}} {{HASHTAG}} {{ISIS}} {{URL}} |
Let’s now see how str_count()
works.
First, well turn tagger_regex
into a list
where each RegEx is it’s own element, then we’ll set_names()
so we can access each RegEx conveniently with $
.
counter_regex <- tagger_regex %>%
names() %>%
as.list() %>%
set_names(c("isis", "hashtag", "screen_name", "url", "mention_hashtag_start"))
counter_regex
## $isis
## [1] "(I[SL])+|[Ii]slamic.?[Ss]tate"
##
## $hashtag
## [1] "#(\\w|_)+\\b"
##
## $screen_name
## [1] "@(\\w|_)+\\b"
##
## $url
## [1] "\\bhttp.*?(\\s|$)"
##
## $mention_hashtag_start
## [1] "@|#"
Now we’ll use counter_regex
count up the number of matches for each pattern
and place the results in a new column
practice_tweets_df %>%
mutate(n_screen_names = str_count(string = text,
pattern = counter_regex$screen_name),
n_hashtags = str_count(string = text,
pattern = counter_regex$hashtag),
n_urls = str_count(string = text,
pattern = counter_regex$url),
n_isis_mentions = str_count(string = text,
pattern = counter_regex$isis)) %>%
prettify()
text | n_screen_names | n_hashtags | n_urls | n_isis_mentions |
---|---|---|---|---|
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k | 0 | 1 | 1 | 1 |
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 | 0 | 0 | 1 | 1 |
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah | 0 | 2 | 0 | 1 |
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. | 0 | 1 | 0 | 1 |
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. | 0 | 0 | 0 | 2 |
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT | 1 | 4 | 1 | 2 |
Fifth Tribe is a DC-based digital agency that scraped 17,410 tweets from pro-ISIS accounts following the November 2015 Paris Attacks. The scraped data begin in January 2015 and end in May 2016.
Fifth Tribe submitted the data to Kaggle, an online data science community that allows users to find and publish data sets and enter competitions to solve data science challenges.
The data are availabe on Kaggle, as well as the GitHub repository ababen/How-Isis-Uses-Twitter.
You can obtain the data either way, but here’s a convenience function that reads the CSV file directly from the above GitHub repository.
get_isis_fanboys_data <- function() {
readr::read_csv(
"https://raw.githubusercontent.com/ababen/How-Isis-Uses-Twitter/master/tweets.csv"
)
}
init_isis_tweets <- get_isis_fanboys_data()
init_isis_tweets %>%
glimpse()
## Observations: 17,410
## Variables: 8
## $ name <chr> "GunsandCoffee", "GunsandCoffee", "GunsandCoffe...
## $ username <chr> "GunsandCoffee70", "GunsandCoffee70", "GunsandC...
## $ description <chr> "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews",...
## $ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ followers <dbl> 640, 640, 640, 640, 640, 640, 640, 640, 640, 64...
## $ numberstatuses <dbl> 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49,...
## $ time <chr> "1/6/2015 21:07", "1/6/2015 21:27", "1/6/2015 2...
## $ tweets <chr> "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU...
Fortunately, the data don’t require extensive cleaning, but there are two things we want to address immediately.
time
column is of type character
, so we need to convert it to a proper date-time format to actually use it in our analysis.init_isis_tweets
, then…select()
the following columns:
time
, but rename it to created_at
username
, but rename it to screen_name
numberstatuses
, but rename it to status_count
tweets
, but rename it to text
followers
, but rename it to followers_count
everything()
(any remaining columns)mutate()
created_at
so that it’s a standard date-time data type
as.POSIXct()
to convert created_at
to type POSIXct
POSIXct
is a data type that represents time using the UNIX Epoch time
format=
which tells R where to find the:
%m
(1 or 2 digit month )%d
(1 or 2 digit day)%Y
(4 digit year)%H
(1 or 2 digit hour)%M
(1 or 2 digit minute)isis_tweets <- init_isis_tweets %>% # Step 1.
select(created_at = time, # 2a.
screen_name = username, # 2b.
status_count = numberstatuses, # 2c.
text = tweets, # 2d.
followers_count = followers, # 2e.
everything()) %>% # 2f.
mutate(created_at = as.POSIXct(created_at, # 3.
format = "%m/%d/%Y %H:%M"))
isis_tweets %>%
glimpse()
## Observations: 17,410
## Variables: 8
## $ created_at <dttm> 2015-01-06 21:07:00, 2015-01-06 21:27:00, 201...
## $ screen_name <chr> "GunsandCoffee70", "GunsandCoffee70", "Gunsand...
## $ status_count <dbl> 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49...
## $ text <chr> "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHF...
## $ followers_count <dbl> 640, 640, 640, 640, 640, 640, 640, 640, 640, 6...
## $ name <chr> "GunsandCoffee", "GunsandCoffee", "GunsandCoff...
## $ description <chr> "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews"...
## $ location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
For simplicity’s sake, we’ll split our data into two variables: isis_statuses
and isis_users
.
isis_statuses
only keeps distinct()
rows from the created_at
, screen_name
, and text
columns.
isis_statuses <- isis_tweets %>%
distinct(created_at, screen_name, text)
isis_statuses %>%
glimpse()
## Observations: 17,410
## Variables: 3
## $ created_at <dttm> 2015-01-06 21:07:00, 2015-01-06 21:27:00, 2015-01...
## $ screen_name <chr> "GunsandCoffee70", "GunsandCoffee70", "GunsandCoff...
## $ text <chr> "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFUL I...
isis_users
only keeps distinct()
rows from the screen_name
, name
, description
, status_count
, and followers_count
columns.
isis_users <- isis_tweets %>%
distinct(screen_name, name, description, status_count, followers_count)
isis_users %>%
glimpse()
## Observations: 325
## Variables: 5
## $ screen_name <chr> "GunsandCoffee70", "AbuLaythAlHindi", "YazeedD...
## $ name <chr> "GunsandCoffee", "Abu Layth Al Hindi", "ابو ال...
## $ description <chr> "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews"...
## $ status_count <dbl> 49, 18, 127, 273, 471, 274, 273, 127, 797, 798...
## $ followers_count <dbl> 640, 68, 904, 112, 25, 119, 119, 823, 324, 328...
Since the data don’t provide us many of the entities that Twitter actually provides, we need to manually extract the hashtags
used, screen names mentioned (mentions_screen_name
), and URLs shared (urls_url
).
With that in mind, let’s write a set of extract_all_*()
functions.
In order to best enforce correctness, the RegEx pattern
s we’ll use here are more complicated than those we used earlier. Unfortunately, working with text with loose format requirements and multiple language gets extremely complicated and the nuances of character encoding are well beyond the scope of this lesson. However, since we want our analysis to be correct, the pattern
s we’ll use for the rest of the lesson will be more robust to real-world data.
First, we’ll a function to extract_all_hashtags()
…
extract_all_hashtags <- function(tweet) {
str_extract_all(string = tweet, pattern = "(?<=#|#)\\w+")
}
Next, we’ll write a function to extract_all_mentions()
using similar code…
Then we’ll write a function to extract_all_urls()
…
extract_all_urls <- function(tweet) {
str_extract_all(
string = tweet,
pattern = "(https?://)?(?:www\\.)?[\\w\\d\\-_]+\\.\\w{2,3}(\\.\\w{2})?(/(?<=/)(?:[\\w\\d\\-./_]+)?)?"
)
}
Here’s a function to test whether or not a tweet is a retweet…
Last, we’ll write a function to tell us whether or not a tweet contains_arabic_script()
…
contains_arabic_script <- function(tweet) {
str_detect(string = tweet, pattern = "[\u0621-\u064A\u0660-\u0669]")
}
The pattern=
argument ("[\u0621-\u064A\u0660-\u0669]"
) are the unicode sequences corresponding to Arabic characters. R, along with many other older programming languages, don’t handle non-Latin letters very well and the right-to-left cursive style of Arabic doesn’t always play nice with R, especially on Windows.
That said, unicode sequences work just fine under the hood…
## ء آ أ ي غ
Using our helper functions, let’s enrich our data.
isis_statuses
mutate()
to add the columns…
is_translated
prepped_text
: remove translation tags, trim whitespace, cast to lowercasemutate()
to add columns using…
extract_all_hashtags()
extract_all_mentions()
extract_all_urls()
mutate()
to annotate whether each tweet..
has_hashtags
has_urls
has_mentions
hashtags
, mentions_screen_names
, and urls_url
can have multiple values, they are list
columns
map()
to iterate over each row, checking whether or not it’s emptymutate()
to add the columns…
is_retweet
contains_arabic_script
augmented_statuses <- isis_statuses %>% # Step 1.
mutate(is_translated = str_detect(text, # 2a.
"^ENGLISH TRANS(LATION|CRIPT)"),
prepped_text = text %>% # 2b.
str_remove("^ENGLISH TRANS(LATION|CRIPT) ?[:-]") %>%
str_trim() %>%
str_to_lower()
) %>%
mutate(hashtags = extract_all_hashtags(prepped_text), # 3a.
mentions_screen_name = extract_all_mentions(prepped_text), # 3b.
urls_url = extract_all_urls(prepped_text)) %>% # 3c.
mutate(has_hashtags = map_lgl(hashtags, ~ length(.x) > 0), # 4a.
has_mentions = map_lgl(mentions_screen_name, ~ length(.x) > 0), # 4b.
has_urls = map_lgl(urls_url, ~ length(.x) > 0)) %>% # 4c.
mutate(is_retweet = is_retweet(prepped_text), # 5a.
contains_arabic_script = contains_arabic_script(prepped_text)) # 5b.
Now we can use the columns of type logical
(has_hashtags
, has_urls
, has_mentions
, is_retweet
, contains_arabic_script
) to easily filter our results.
augmented_statuses %>%
filter(has_hashtags, has_urls, has_mentions, is_retweet, contains_arabic_script) %>%
select_if(negate(is.logical)) %>%
glimpse()
## Observations: 64
## Variables: 7
## $ created_at <dttm> 2015-09-09 21:59:00, 2015-09-15 15:20:00...
## $ screen_name <chr> "abubakerdimshqi", "abubakerdimshqi", "ab...
## $ text <chr> "RT @__IslamReligion: Why Do Muslims Eat ...
## $ prepped_text <chr> "rt @__islamreligion: why do muslims eat ...
## $ hashtags <list> [<"usa", "uk", "sasummer", "rt">, <"جيش_...
## $ mentions_screen_name <list> ["__islamreligion", "freealeppo1985", "o...
## $ urls_url <list> ["http://t.co/ceetypwwmq", "http://t.co/...
Using augmented_statuses
, we can get started on our actual analyis.
country_regex_df <- countrycode::codelist %>% # Step 1.
as_tibble() %>% # 2.
select(country_name = country.name.en, # 3a.
regex = country.name.en.regex) %>% # 3b.
filter(!country_name %in% countries_to_skip) %>% # 4.
mutate(regex = paste0("(?<=\\b)(", regex, ")(?=\\b)")) # 5.
country_regex_df
## # A tibble: 279 x 2
## country_name regex
## <chr> <chr>
## 1 Afghanistan "(?<=\\b)(afghan)(?=\\b)"
## 2 <U+00C5>land Islands "(?<=\\b)(<U+00E5>land)(?=\\b)"
## 3 Albania "(?<=\\b)(albania)(?=\\b)"
## 4 Algeria "(?<=\\b)(algeria)(?=\\b)"
## 5 American Samoa "(?<=\\b)(^(?=.*americ).*samoa)(?=\\b)"
## 6 Andorra "(?<=\\b)(andorra)(?=\\b)"
## 7 Angola "(?<=\\b)(angola)(?=\\b)"
## 8 Anguilla "(?<=\\b)(anguill?a)(?=\\b)"
## 9 Antarctica "(?<=\\b)(antarctica)(?=\\b)"
## 10 Antigua & Barbuda "(?<=\\b)(antigua)(?=\\b)"
## # ... with 269 more rows
Next, we’re going to perform a fuzzy left-join, but what’s a left-join anyways?
A left-join is used to combine tables (data.frame
s in R) so that all rows on the left-hand side are kept, but the values matched in mutual column(s) on the right-hand side are added.
Let’s do a quick example. Here are two data.frame
s.
lhs <- tribble( # # A tibble: 3 x 2
~key, ~val_lhs, # key val_lhs
"a", 1, # <chr> <dbl>
"b", 2, # 1 a 1
"c", 3 # 2 b 2
) # 3 c 3
rhs <- tribble( # # A tibble: 3 x 2
~key, ~val_rhs, # key val_rhs
"a", 3, # <chr> <dbl>
"b", 4, # 1 a 3
"d", 5 # 2 b 4
) # 3 d 5
If we want to add the rows of the rhs
(right-hand side) to lhs
(left-hand side) whereve the key
s match, we perform a left_join()
.
## # A tibble: 3 x 3
## key val_lhs val_rhs
## <chr> <dbl> <dbl>
## 1 a 1 3
## 2 b 2 4
## 3 c 3 NA
By default, left_join()
will match all the columns in both data.frame
s that have the same name, but in practice we should always eliminate ambiguity (and thus unsafety) by specifying which columns to match by providing an argument to left_join()
’s by=
parameter.
## # A tibble: 3 x 3
## key val_lhs val_rhs
## <chr> <dbl> <dbl>
## 1 a 1 3
## 2 b 2 4
## 3 c 3 NA
If the columns names don’t match, we again provide an argument to by=
, but we clarify which columns to match using a named vector
.
lhs <- lhs %>% rename(key_lhs = key)
rhs <- rhs %>% rename(key_hrs = key)
lhs %>%
left_join(rhs, by = c("key_lhs" = "key_hrs"))
## # A tibble: 3 x 3
## key_lhs val_lhs val_rhs
## <chr> <dbl> <dbl>
## 1 a 1 3
## 2 b 2 4
## 3 c 3 NA
If you hadn’t guessed, left_join()
is just one kind of table join. We won’t discuss them in this lesson, but you can also use right_join()
, inner_join()
, outer_join()
, anti_join()
, and more.
Table joins are typically done to match exact values, but we can use the {fuzzyjoin
} package to match inexact values, including regular expressions. We’ll do just that to create a new variable named tagged_countries
that will add the country_name
column from country_regex_df
to augmented_statuses
. Essentially, we’re just adding a column noting which country is mentioned, if any, to augmented_statuses
.
augmented_statuses
regex_left_join()
country_regex_df
by=
argument using a named vector
"lower_text"
) is the column on the left-hand side (in augmented_statuses
) with which we want to match the value ("regex"
), which is the column on the right-hand side (in country_regex_df
)tagged_countries <- augmented_statuses %>% # Step 1.
regex_left_join(country_regex_df, # 2a.
by = c("prepped_text" = "regex")) # 2b.
tagged_countries %>%
drop_na(country_name) %>%
select(prepped_text, country_name) %>%
prettify()
prepped_text | country_name |
---|---|
: ’a message to the truthful in syria - sheikh abu muhammed al maqdisi: http://t.co/73xfszsjvr http://t.co/x8bzcscxzq | Syria |
new link, after previous one taken down:aqap-‘the faces have been brightened’ -regarding the blessed attack in france http://t.co/ralsnpd547 | France |
#breaking #confirmed islamic state takes control of al-jusiya border post linking jurud al-qaa in lebanon to qusayr in homs countryside | Lebanon |
history repeated itself. jn almost vanished when #is came to #syria and massive bayah from #aleppo & #raqqa https://t.co/qmmg1k8csa | Syria |
@macroarch: أبو سمرا، طرابلس، لبنان abo samra, tripoli, lebanon ht | Lebanon |
iraq hashd criminals bought to justice same way the killed sunnis , they were killed by is https://t.co/qsmogs3ihh/s/cgxy http. | Iraq |
#is #wilayatbarqah #libya distributing da’wah leaflets in noufaliy | Libya |
poor is “baathists & saddamists”,even russia have intervenerad on behalf of #assad so, kuffar, nationalists, apostates, sultan scholars etc | Russia |
us-trained division 30 has entered marea to fight the islamic state, jn has reportedly allowed this (?!) http://t.co/xc7g/s/6khj. | United States |
breaking nigeria islamic state advance on lagos the business capital biggest city in nigeria http://t.co/nfchta18f3/s/n6gp | Nigeria |
Now that we have a column containing many of the countries mentioned in each tweet, we can visualize the most discussed countries.
tagged_countries
NA
values with `drop_na()count()
up how many times each country_name
occurs
count()
returns the column counted and a new column: n
n
mutate()
country_name
, turning into a factor
ordered by n
ggplot()
, using country_name
as the x=
aes()
thetic and n
for the y=
aes()
theticgeom_col()
and set the fill
color to n
coord_flip()
fill
the columnstheme_minimal()
to add a nice themelabs()
tagged_countries %>% # Step 1.
drop_na(country_name) %>% # 2.
count(country_name) %>% # 3.
top_n(n = 25, wt = n) %>% # 4.
arrange(n) %>% # 5.
mutate(country_name = as_factor(country_name)) %>% # 6.
ggplot(aes(x = country_name, y = n)) + # 7.
geom_col(aes(fill = n), show.legend = FALSE) + # 8.
coord_flip() + # 9.
scale_fill_viridis_c() + # 10.
theme_minimal(base_family = "serif") + # 11.
labs(title = "Top-25 Most Mentioned Countries", # 12.
x = NULL, y = "# of Mentions")
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Arabic_Saudi Arabia.1256
## [2] LC_CTYPE=Arabic_Saudi Arabia.1256
## [3] LC_MONETARY=Arabic_Saudi Arabia.1256
## [4] LC_NUMERIC=C
## [5] LC_TIME=Arabic_Saudi Arabia.1256
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
## [5] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0
## [9] tidyverse_1.2.1 lubridate_1.7.4 fuzzyjoin_0.1.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.8 haven_2.1.1
## [4] lattice_0.20-38 colorspace_1.4-1 vctrs_0.2.0
## [7] generics_0.0.2 htmltools_0.3.6 viridisLite_0.3.0
## [10] yaml_2.2.0 utf8_1.1.4 rlang_0.4.0
## [13] pillar_1.4.2 glue_1.3.1 withr_2.1.2
## [16] modelr_0.1.4 readxl_1.3.1 munsell_0.5.0
## [19] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.4
## [22] htmlwidgets_1.3 kableExtra_1.1.0 evaluate_0.14
## [25] labeling_0.3 knitr_1.23 curl_4.0
## [28] fansi_0.4.0 highr_0.8 broom_0.5.2
## [31] Rcpp_1.0.2 scales_1.0.0 backports_1.1.4
## [34] webshot_0.5.1 jsonlite_1.6 countrycode_1.1.0
## [37] hms_0.5.0 digest_0.6.20 stringi_1.4.3
## [40] rprojroot_1.3-2 grid_3.6.1 here_0.1
## [43] cli_1.1.0.9000 tools_3.6.1 magrittr_1.5
## [46] lazyeval_0.2.2 crayon_1.3.4 pkgconfig_2.0.2
## [49] zeallot_0.1.0 ellipsis_0.2.0.9000 xml2_1.2.1
## [52] assertthat_0.2.1 rmarkdown_1.14 httr_1.4.0
## [55] rstudioapi_0.10 R6_2.4.0 nlme_3.1-140
## [58] compiler_3.6.1