Introduction

The goal of this lesson is to provide you with foundational skills for handling text data. By the end, you’ll be able to match, search, manipulate, and analyze text in more powerful ways and larger quantities than you could before.

Setting Up

Before we get started, let’s make sure our environment is ready for work and we have all the packages that we’ll use installed.

  1. Start with a clean R workspace.
    • If you’re using RStudio, you can click Session and Restart R, or just press Ctrl + Shift + F10.
  2. Check and install any missing packages you’ll need for this lesson.

Load Packages

The following function isn’t part of the lesson plan, but we’ll be using it to print() data.frames in a more human-readable way than printing to the console.

Strings

In programming, we usually refer to character sequences as strings. In R, we formally refer to strings as being of type character, but we’ll typically refer to them as strings anyways.

Creating a String

To make strings, we wrap text inside of either double or single quotation marks, like so…

## [1] "This is how we make a string with double quotes."
## [1] "This is how we make a string with single quotes."

The easiest way to include qutotation marks inside of a string is to simply use different quotes to wrap our text, like so…

## [1] "This is how we make a string that CONTAINS \"DOUBLE QUOTES\"."
## [1] "This is how we make a string that CONTAINS 'SINGLE QUOTES'."

Whitespace

Line breaks, such as those created when you press the Enter or Return key, are represented by "\n".

## [1] "\nThis string\nhas line\nbreaks.\n"

As you can see, the line breaks are replaced with "\n". If we were to directly insert "\n" instead of pressing Enter, it would look like so…

If we want to print() a string so that the text is rendered in a human-readable format, we can concatenate all the sequence using cat().

## 
## This string
## has line
## breaks.

Similarly, tabs are represented by "\t".

## This string  has tabs.

Before we move on, let’s create a character vector named practice_tweets that we’ll use for the next exercises.

Regular Expressions

The "\" used in "\n" and "\t" is a backslash. Backslashes are used to escape characters, which is how we can trigger alternative interpretation of subsequent characters. The concept of escape characters and other special character gives us powerful ways manipulate text data using Regular Expressions, or RegEx. We’ll harness of the power of regular expressions using the {stringr} package that loads when we call library(tidyverse).

We use regular expressions to match character sequences. Using stringr::str_view(), we can explore exactly how text is matched.

Basic Pattern Matching

The simplest patterns to match are those that can be interpreted literally. By literal, I mean what you see is what you get.

For example, let’s say we want find all the occurrences of "ISIS" in practice_tweets. Our pattern will simply be "ISIS".



We can use literal patterns to match letters, numbers, and many other characters such as "#", and "/".



Notice that str_view() only matches the first occurrence of our patterns. In order to match all occurrences, {stringr} uses a convention where the function name is suffixed with *_all(), like so…



If we want to match multiple potential characters, we can separate individual patterns with "|", pronounced “or”.

For example, if we wanted to match all occurrences of either "#" or "/", we can use "#|/" like so…



We can also enclose optional individual character patterns inside square-brackets for the same effect, like so…



There are many sequences that are used often enough that they have special, built-in patterns. The ones we’ll use most often are to capture digits, punctuation, and letters.

If we want to match any number, we use the pattern "[0-9]".



If we want to match any punctuation, we use the pattern "[:punct:]".


If we want to match any letter, we use the pattern "[A-z]".


If we want to match only lower-case letters or only upper-case letters, we use "[a-z]" or "[A-Z]", respectively, like so…


We’ve been using str_view_all() in order match all occurrences of the patterns we provide, but this results in multiple, separate matches. Note that the outputs above contain individual boxes for each matched character.

However, we typically want to match patterns that contain multiple characters, or even an unknown amount of characters, for which we need to use some special characters.

You may have noticed that when we try to match the pattern "ISIS", we don’t match "IS".



If we think of "ISIS" as two back-to-back occurrenes of the pattern "IS" then we solve this problem by matching "IS" if it occurs once or more.

We can do this by enclosing "IS" inside "()", which we refer to as a group.

We then append "+" at the end of the group, making "(IS)+", which we can then use to match "IS" or "ISIS".



However, we still have a problem: we’re still not matching "ISIL".

If you recall, we were able to match "#" or "/" by using either square-brackets or "|". We can use the same strategy here to match either "S" or "L", like so…



Now our pattern matches all occurrences of "IS", "ISIS" or "ISIL".

Detecting Patterns

If we want to match the boundary of a word, we use \\b and if we want to match any word character ([A-z0-9_]), we use \\w.

With that in mind, lets create a basic_url_regex variable and a basic_hashtag_regex variable.

Since we’re usually using data.frames to manage our data in R, let’s also create a tibble data.frame without our practice_tweets in a column named text.

text
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants.
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side.
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT

If we want to to whether a pattern is matched, we can get TRUE or FALSE using str_detect()

Let’s add a column to practice_tweets_df called has_url.

text has_url
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k TRUE
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 TRUE
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah FALSE
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. FALSE
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. FALSE
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT TRUE

Extracting Patterns

If we want to extract a pattern from a string, we use str_extract().

Let’s add a column named url that contains the result of str_extract()

text url
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k https://t.co/4ZsnI05o2k
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 https://t.co/ugFmV322s0
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah NA
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. NA
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. NA
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT https://t.co/dNWvurivFT

str_extract() only returns the first sequence that matches pattern. If we want to extract all matches, we use str_extract_all().

Let’s add a column named hashtags using str_extract_all().

text hashtags
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k #ISIS
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 character(0)
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah c(“#Iraqi”, “#Fallujah”)
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. #Iraqi
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. character(0)
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT c(“#ISIS”, “#Raqqa”, “#Syria”, “#ISIL”)

Since str_extract_all() is designed to return multiple matches, it returns a list by default. Note that that list columns are different from most columns, which are usually atomic vectors and require different approaches to handle.

## # A tibble: 6 x 2
##   text                                                             hashtags
##   <chr>                                                            <list>  
## 1 New #ISIS media release to be on air soon. https://t.co/4ZsnI05~ <chr [1~
## 2 Denmark to expand anti-ISIS military mission with 400 Elite Sol~ <chr [0~
## 3 Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah         <chr [2~
## 4 #Iraqi Army claims: Repelled IS assault in Northern Baiji and k~ <chr [1~
## 5 Nusra used to have a 5 to 7 km frontline against IS in North-Al~ <chr [0~
## 6 RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raq~ <chr [4~

Removing Substrings

If we want to remove matched patterns, we use str_remove() and str_remove_all().

text no_url
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k New #ISIS media release to be on air soon.
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 Denmark to expand anti-ISIS military mission with 400 Elite Soldiers….
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. #Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants.
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side.
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL
text no_punct
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k New ISIS media release to be on air soon httpstco4ZsnI05o2k
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 Denmark to expand antiISIS military mission with 400 Elite Soldiers httpstcougFmV322s0
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah Nearly 40 Iraqi Soldiers killed by ISIS in NE Fallujah
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. Iraqi Army claims Repelled IS assault in Northern Baiji and killed 62 Militants
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. Nusra used to have a 5 to 7 km frontline against IS in NorthAleppo IS could never advance from that side
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT RT Raqqasl1 ISIS cover books for the 3 Third grade 3 Raqqa Syria ISIL httpstcodNWvurivFT

Combining Strings

If we want to combine separate strings, into a single character vector, we use str_c() (or paste0()).

We can provide an argument to sep= indiciating a character we’d like to stick insert between each of the strings we provide.

## [1] "\\bhttp.*\\b|[:punct:]"
text no_urls_or_punct
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k New ISIS media release to be on air soon
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 Denmark to expand antiISIS military mission with 400 Elite Soldiers
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah Nearly 40 Iraqi Soldiers killed by ISIS in NE Fallujah
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. Iraqi Army claims Repelled IS assault in Northern Baiji and killed 62 Militants
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. Nusra used to have a 5 to 7 km frontline against IS in NorthAleppo IS could never advance from that side
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT RT Raqqasl1 ISIS cover books for the 3 Third grade 3 Raqqa Syria ISIL

Removing Excess Whitespace

Whitespace inside of strings is something we’ll need to handle often. To demonstrate, we’ll add some more columns to practice_tweets_df and assigned the result to a variable named excess_whitespace_df.

  • Steps:
    1. take practice_tweets_df
    2. rename() the text column to original_text
    3. mutate() to add a column with excess_whitespace by str_replace_all() inserting " " at the start, end, and anywere there’s already a space in original_text.
    4. filter() rows, only keep those where excess_whitespace has less than 100 characters.
    5. convert excess_whitespace_df to a list with as.list(), print()ing it in a way that our browsers won’t remove the excess whitespace
## $original_text
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"          
## 
## $excess_whitespace
## [1] "   New   #ISIS   media   release   to   be   on   air   soon.   https://t.co/4ZsnI05o2k   "
## [2] "   Nearly   40   #Iraqi   Soldiers   killed   by   ISIS   in   NE   #Fallujah   "

Now let’s add more columns so we can see the effects of various techniques to remove excess_whitespace.

  • Steps:
    1. take excess_whitespace_df
    2. mutate() to add columns that
      1. str_squish() the text to remove excess whitespace inside the string
      2. str_trim() the text to remove whitespace and the start and end of the string
      3. str_trim() and str_squish() the string
      4. str_remove_all() with pattern=\\s+ removeing all whitespace anywhere in the string
    3. convert the result to a list with as.list(), print()ing it in a way that our browsers won’t remove the excess whitespace.
## $original_text
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"          
## 
## $excess_whitespace
## [1] "   New   #ISIS   media   release   to   be   on   air   soon.   https://t.co/4ZsnI05o2k   "
## [2] "   Nearly   40   #Iraqi   Soldiers   killed   by   ISIS   in   NE   #Fallujah   "          
## 
## $squished
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"          
## 
## $trimmed
## [1] "New   #ISIS   media   release   to   be   on   air   soon.   https://t.co/4ZsnI05o2k"
## [2] "Nearly   40   #Iraqi   Soldiers   killed   by   ISIS   in   NE   #Fallujah"          
## 
## $squished_and_trimed
## [1] "New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k"
## [2] "Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah"          
## 
## $no_whitespace
## [1] "New#ISISmediareleasetobeonairsoon.https://t.co/4ZsnI05o2k"
## [2] "Nearly40#IraqiSoldierskilledbyISISinNE#Fallujah"

Replacing Substrings

Since you just saw str_replace_all(), let’s see some more examples of how we can replace character sequences in bulk.

text urls_replaced
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k New #ISIS media release to be on air soon. {{THIS WAS A URL}}
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. {{THIS WAS A URL}}
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. #Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants.
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side.
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL {{THIS WAS A URL}}

When dealing with many complicated RegEx patterns, it’s often easier to set up a single variable to str_replace_all() patterns simultaneously.

We can do this use using a named vector where the names are the patterns we want to match and the values are their replacements

Here’s one called tagger_regex.

## (I[SL])+|[Ii]slamic.?[Ss]tate                  #(\\w|_)+\\b 
##                    "{{ISIS}}"                 "{{HASHTAG}}" 
##                  @(\\w|_)+\\b             \\bhttp.*?(\\s|$) 
##             "{{SCREEN_NAME}}"                     "{{URL}}" 
##                           @|# 
##                            ""
## # A tibble: 5 x 2
##   pattern                       replacement    
##   <chr>                         <chr>          
## 1 (I[SL])+|[Ii]slamic.?[Ss]tate {{ISIS}}       
## 2 "#(\\w|_)+\\b"                {{HASHTAG}}    
## 3 "@(\\w|_)+\\b"                {{SCREEN_NAME}}
## 4 "\\bhttp.*?(\\s|$)"           {{URL}}        
## 5 @|#                           ""

Using tagger_regex let’s standardize references to ISIS, hashtags, screen names and URLs, while also dropping @ and #.

text standardized
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k New {{ISIS}} media release to be on air soon. {{URL}}
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 Denmark to expand anti-{{ISIS}} military mission with 400 Elite Soldiers…. {{URL}}
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah Nearly 40 {{HASHTAG}} Soldiers killed by {{ISIS}} in NE {{HASHTAG}}
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. {{HASHTAG}} Army claims: Repelled {{ISIS}} assault in Northern Baiji and killed 62 Militants.
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. Nusra used to have a 5 to 7 km frontline against {{ISIS}} in North-Aleppo. {{ISIS}} could never advance from that side.
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT RT {{SCREEN_NAME}}: {{ISIS}} cover books for the 3 Third grade (3) {{HASHTAG}} {{HASHTAG}} {{ISIS}} {{URL}}

Counting Substrings

Let’s now see how str_count() works.

First, well turn tagger_regex into a list where each RegEx is it’s own element, then we’ll set_names() so we can access each RegEx conveniently with $.

## $isis
## [1] "(I[SL])+|[Ii]slamic.?[Ss]tate"
## 
## $hashtag
## [1] "#(\\w|_)+\\b"
## 
## $screen_name
## [1] "@(\\w|_)+\\b"
## 
## $url
## [1] "\\bhttp.*?(\\s|$)"
## 
## $mention_hashtag_start
## [1] "@|#"

Now we’ll use counter_regex count up the number of matches for each pattern and place the results in a new column

text n_screen_names n_hashtags n_urls n_isis_mentions
New #ISIS media release to be on air soon. https://t.co/4ZsnI05o2k 0 1 1 1
Denmark to expand anti-ISIS military mission with 400 Elite Soldiers…. https://t.co/ugFmV322s0 0 0 1 1
Nearly 40 #Iraqi Soldiers killed by ISIS in NE #Fallujah 0 2 0 1
#Iraqi Army claims: Repelled IS assault in Northern Baiji and killed 62 Militants. 0 1 0 1
Nusra used to have a 5 to 7 km frontline against IS in North-Aleppo. IS could never advance from that side. 0 0 0 2
RT @Raqqa_sl1: #ISIS cover books for the 3 Third grade (3) #Raqqa #Syria #ISIL https://t.co/dNWvurivFT 1 4 1 2

Case Study: ISIS Fanboys

Fifth Tribe is a DC-based digital agency that scraped 17,410 tweets from pro-ISIS accounts following the November 2015 Paris Attacks. The scraped data begin in January 2015 and end in May 2016.

Fifth Tribe submitted the data to Kaggle, an online data science community that allows users to find and publish data sets and enter competitions to solve data science challenges.

Data Access

The data are availabe on Kaggle, as well as the GitHub repository ababen/How-Isis-Uses-Twitter.

You can obtain the data either way, but here’s a convenience function that reads the CSV file directly from the above GitHub repository.

## Observations: 17,410
## Variables: 8
## $ name           <chr> "GunsandCoffee", "GunsandCoffee", "GunsandCoffe...
## $ username       <chr> "GunsandCoffee70", "GunsandCoffee70", "GunsandC...
## $ description    <chr> "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews",...
## $ location       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ followers      <dbl> 640, 640, 640, 640, 640, 640, 640, 640, 640, 64...
## $ numberstatuses <dbl> 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49,...
## $ time           <chr> "1/6/2015 21:07", "1/6/2015 21:27", "1/6/2015 2...
## $ tweets         <chr> "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU...

Initial Clean Up

Fortunately, the data don’t require extensive cleaning, but there are two things we want to address immediately.

  1. Instead of using the data set’s column names, we’ll use the variable names that Twitter itself uses.
  2. The time column is of type character, so we need to convert it to a proper date-time format to actually use it in our analysis.
  • Steps:
    1. take init_isis_tweets, then…
    2. select() the following columns:
      1. time, but rename it to created_at
      2. username, but rename it to screen_name
      3. numberstatuses, but rename it to status_count
      4. tweets, but rename it to text
      5. followers, but rename it to followers_count
      6. everything() (any remaining columns)
    3. mutate() created_at so that it’s a standard date-time data type
      • we’ll use as.POSIXct() to convert created_at to type POSIXct
        • POSIXct is a data type that represents time using the UNIX Epoch time
          • UNIX Epoch time is a system that tracks time as the number of seconds since 1 January 1970
      • to do so, we also need to provide an argument to format= which tells R where to find the:
        • month: %m (1 or 2 digit month )
        • day: %d (1 or 2 digit day)
        • year: %Y (4 digit year)
        • hour: %H (1 or 2 digit hour)
        • minute: %M (1 or 2 digit minute)
## Observations: 17,410
## Variables: 8
## $ created_at      <dttm> 2015-01-06 21:07:00, 2015-01-06 21:27:00, 201...
## $ screen_name     <chr> "GunsandCoffee70", "GunsandCoffee70", "Gunsand...
## $ status_count    <dbl> 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49...
## $ text            <chr> "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHF...
## $ followers_count <dbl> 640, 640, 640, 640, 640, 640, 640, 640, 640, 6...
## $ name            <chr> "GunsandCoffee", "GunsandCoffee", "GunsandCoff...
## $ description     <chr> "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews"...
## $ location        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

For simplicity’s sake, we’ll split our data into two variables: isis_statuses and isis_users.

isis_statuses only keeps distinct() rows from the created_at, screen_name, and text columns.

## Observations: 17,410
## Variables: 3
## $ created_at  <dttm> 2015-01-06 21:07:00, 2015-01-06 21:27:00, 2015-01...
## $ screen_name <chr> "GunsandCoffee70", "GunsandCoffee70", "GunsandCoff...
## $ text        <chr> "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFUL I...

isis_users only keeps distinct() rows from the screen_name, name, description, status_count, and followers_count columns.

## Observations: 325
## Variables: 5
## $ screen_name     <chr> "GunsandCoffee70", "AbuLaythAlHindi", "YazeedD...
## $ name            <chr> "GunsandCoffee", "Abu Layth Al Hindi", "ابو ال...
## $ description     <chr> "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews"...
## $ status_count    <dbl> 49, 18, 127, 273, 471, 274, 273, 127, 797, 798...
## $ followers_count <dbl> 640, 68, 904, 112, 25, 119, 119, 823, 324, 328...

Extract Tweet Entities

Since the data don’t provide us many of the entities that Twitter actually provides, we need to manually extract the hashtags used, screen names mentioned (mentions_screen_name), and URLs shared (urls_url).

With that in mind, let’s write a set of extract_all_*() functions.

Helper Functions

In order to best enforce correctness, the RegEx patterns we’ll use here are more complicated than those we used earlier. Unfortunately, working with text with loose format requirements and multiple language gets extremely complicated and the nuances of character encoding are well beyond the scope of this lesson. However, since we want our analysis to be correct, the patterns we’ll use for the rest of the lesson will be more robust to real-world data.

First, we’ll a function to extract_all_hashtags()

Next, we’ll write a function to extract_all_mentions() using similar code…

Then we’ll write a function to extract_all_urls()

Here’s a function to test whether or not a tweet is a retweet…

Last, we’ll write a function to tell us whether or not a tweet contains_arabic_script()

The pattern= argument ("[\u0621-\u064A\u0660-\u0669]") are the unicode sequences corresponding to Arabic characters. R, along with many other older programming languages, don’t handle non-Latin letters very well and the right-to-left cursive style of Arabic doesn’t always play nice with R, especially on Windows.

That said, unicode sequences work just fine under the hood…

## ء آ أ ي غ

Using our helper functions, let’s enrich our data.

  • Steps:
    1. take isis_statuses
    2. mutate() to add the columns…
      1. is_translated
      2. prepped_text: remove translation tags, trim whitespace, cast to lowercase
    3. mutate() to add columns using…
      1. extract_all_hashtags()
      2. extract_all_mentions()
      3. extract_all_urls()
    4. mutate() to annotate whether each tweet..
      1. has_hashtags
      2. has_urls
      3. has_mentions
      • since hashtags, mentions_screen_names, and urls_url can have multiple values, they are list columns
        • use map() to iterate over each row, checking whether or not it’s empty
    5. mutate() to add the columns…
      1. is_retweet
      2. contains_arabic_script

Now we can use the columns of type logical (has_hashtags, has_urls, has_mentions, is_retweet, contains_arabic_script) to easily filter our results.

## Observations: 64
## Variables: 7
## $ created_at           <dttm> 2015-09-09 21:59:00, 2015-09-15 15:20:00...
## $ screen_name          <chr> "abubakerdimshqi", "abubakerdimshqi", "ab...
## $ text                 <chr> "RT @__IslamReligion: Why Do Muslims Eat ...
## $ prepped_text         <chr> "rt @__islamreligion: why do muslims eat ...
## $ hashtags             <list> [<"usa", "uk", "sasummer", "rt">, <"جيش_...
## $ mentions_screen_name <list> ["__islamreligion", "freealeppo1985", "o...
## $ urls_url             <list> ["http://t.co/ceetypwwmq", "http://t.co/...

Using augmented_statuses, we can get started on our actual analyis.

Hashtags

If you’re not familiar, hashtags are a way for users to tag their content to make it easier for others to find it.

Using hashtags, we can develop a broad sense of the topics discussed in the data. While you’ve likely seen word clouds, we’re going to stick to visualizations that actually mean something.

With that in mind, let’s create a bar chart showing us the top-40 hashtags used in the data.

  • Steps:
    1. take augmented_statuses
    2. using our new has_hashtags column, filter() rows to only keep statuses containing hashtags
    3. select() only the hashtags column
    4. unnest() the hashtags column
      • since a status can have multiple hashtags and str_extract_all() returns a list by default, the column is currently a list column
    5. count() up the hashtags, which creates a column named n
    6. using top_n(), only keep the top-40 hashtags by providing 40 and n to the n= and wt= arguments, respectively
    7. sort rows based on the n column
    8. mutate() hashtags to convert them to a data type that has an order (factor)
      • we use forcats::as_factor() as its default behavior is set the order (levels) to the order in which the data exists
      • the purpose of this is so that our final plot’s y-axis is sorted based on n
    9. start the plot using ggplot(), using hashtags as the x= aes()thetic and n for the y= aes()thetic
    10. use geom_col() and set the fill color to n
    11. flip the axes using coord_flip()
    12. add a non-default color scale to fill the columns
    13. use theme_minimal() to add a nice theme
    14. customize the lables with labs()

Unsuprisingly, the top-5 hashtags include #isis, #is, and #islamicstate.

  • We can also see several other hashtag categories:
    • the names of Syrian governorates and cities
      • Palmyra, Aleppo, Homs, Damascus, Deir ez-Zor
    • Iraqi governorates and cities:
      • al-Anbar, Ramadi, Fallujah, Baghdad
    • organizations participating in the conflicts in Iraq and Syria
      • YPG, PKK, SAA
    • countries participating in conflicts in Iraq and Syria and beyond
      • Russia, USA, Turkey, Libya, Egypt

Now that we’ve had a lot of practice using regular expressions, we can use them to do some really clever things.

Suppose we want to explore how countries are referenced, but we don’t have a list of all the ways people refer to a country; countries are rarely discussed using one of their many formal names, no to mention tweeted about.

Fortunately, the {countrycode} package exists, which contains a data.frame called codelist that contains an enormous amount of data on country names. Included in codelist is a column containing regular expressions for most-to-all of the world’s countries.

We’re going to use the column in a somewhat unorthodox way, but let’s do some preparation and make a variable named country_regex_df before we get to that.

Since we’re doing something unconventional, there’s bound to be mistakes, so keep in mind that while regular expressions offer a lot of power, they are exceptionally fragile.

With that in mind, we’re not going to try and match every country in countrycode::codelist.

  • Steps:
    1. take countrycode::codelist
    2. add the tibble class using as_tibble()
    3. only select() the column we’ll use…
      1. country.name.en, but rename to country_name
      2. country.name.en.regex, but rename to regex
    4. filter() rows, only keeping those where country_name is not in countries_to_skip
    5. mutate() the regex column and add a Look-Behind and Look-Ahead so that each regex will only match patterns that are preceded and followed by word boundaries

Countries

## # A tibble: 279 x 2
##    country_name      regex                                  
##    <chr>             <chr>                                  
##  1 Afghanistan       "(?<=\\b)(afghan)(?=\\b)"              
##  2 <U+00C5>land Islands     "(?<=\\b)(<U+00E5>land)(?=\\b)"               
##  3 Albania           "(?<=\\b)(albania)(?=\\b)"             
##  4 Algeria           "(?<=\\b)(algeria)(?=\\b)"             
##  5 American Samoa    "(?<=\\b)(^(?=.*americ).*samoa)(?=\\b)"
##  6 Andorra           "(?<=\\b)(andorra)(?=\\b)"             
##  7 Angola            "(?<=\\b)(angola)(?=\\b)"              
##  8 Anguilla          "(?<=\\b)(anguill?a)(?=\\b)"           
##  9 Antarctica        "(?<=\\b)(antarctica)(?=\\b)"          
## 10 Antigua & Barbuda "(?<=\\b)(antigua)(?=\\b)"             
## # ... with 269 more rows

Next, we’re going to perform a fuzzy left-join, but what’s a left-join anyways?

Joining Tables

A left-join is used to combine tables (data.frames in R) so that all rows on the left-hand side are kept, but the values matched in mutual column(s) on the right-hand side are added.

Let’s do a quick example. Here are two data.frames.

If we want to add the rows of the rhs (right-hand side) to lhs (left-hand side) whereve the keys match, we perform a left_join().

## # A tibble: 3 x 3
##   key   val_lhs val_rhs
##   <chr>   <dbl>   <dbl>
## 1 a           1       3
## 2 b           2       4
## 3 c           3      NA

By default, left_join() will match all the columns in both data.frames that have the same name, but in practice we should always eliminate ambiguity (and thus unsafety) by specifying which columns to match by providing an argument to left_join()’s by= parameter.

## # A tibble: 3 x 3
##   key   val_lhs val_rhs
##   <chr>   <dbl>   <dbl>
## 1 a           1       3
## 2 b           2       4
## 3 c           3      NA

If the columns names don’t match, we again provide an argument to by=, but we clarify which columns to match using a named vector.

## # A tibble: 3 x 3
##   key_lhs val_lhs val_rhs
##   <chr>     <dbl>   <dbl>
## 1 a             1       3
## 2 b             2       4
## 3 c             3      NA

If you hadn’t guessed, left_join() is just one kind of table join. We won’t discuss them in this lesson, but you can also use right_join(), inner_join(), outer_join(), anti_join(), and more.

Finding Country Mentions

Table joins are typically done to match exact values, but we can use the {fuzzyjoin} package to match inexact values, including regular expressions. We’ll do just that to create a new variable named tagged_countries that will add the country_name column from country_regex_df to augmented_statuses. Essentially, we’re just adding a column noting which country is mentioned, if any, to augmented_statuses.

  • Steps:
    1. take augmented_statuses
    2. perform a regex_left_join()
      1. the right-hand side is country_regex_df
      2. provide by= argument using a named vector
        • the name ("lower_text") is the column on the left-hand side (in augmented_statuses) with which we want to match the value ("regex"), which is the column on the right-hand side (in country_regex_df)
prepped_text country_name
: ’a message to the truthful in syria - sheikh abu muhammed al maqdisi: http://t.co/73xfszsjvr http://t.co/x8bzcscxzq Syria
new link, after previous one taken down:aqap-‘the faces have been brightened’ -regarding the blessed attack in france http://t.co/ralsnpd547 France
#breaking #confirmed islamic state takes control of al-jusiya border post linking jurud al-qaa in lebanon to qusayr in homs countryside Lebanon
history repeated itself. jn almost vanished when #is came to #syria and massive bayah from #aleppo &amp; #raqqa https://t.co/qmmg1k8csa Syria
@macroarch: أبو سمرا، طرابلس، لبنان abo samra, tripoli, lebanon ht Lebanon
iraq hashd criminals bought to justice same way the killed sunnis , they were killed by is https://t.co/qsmogs3ihh/s/cgxy http. Iraq
#is #wilayatbarqah #libya distributing da’wah leaflets in noufaliy Libya
poor is “baathists &amp; saddamists”,even russia have intervenerad on behalf of #assad so, kuffar, nationalists, apostates, sultan scholars etc Russia
us-trained division 30 has entered marea to fight the islamic state, jn has reportedly allowed this (?!) http://t.co/xc7g/s/6khj. United States
breaking nigeria islamic state advance on lagos the business capital biggest city in nigeria http://t.co/nfchta18f3/s/n6gp Nigeria

Now that we have a column containing many of the countries mentioned in each tweet, we can visualize the most discussed countries.

  • Steps:
    1. take tagged_countries
    2. drop NA values with `drop_na()
    3. count() up how many times each country_name occurs
      • count() returns the column counted and a new column: n
    4. only keep the top 25 rows
    5. sort rows by n
    6. mutate() country_name, turning into a factor ordered by n
    7. start the plot using ggplot(), using country_name as the x= aes()thetic and n for the y= aes()thetic
    8. use geom_col() and set the fill color to n
    9. flip the axes using coord_flip()
    10. add a non-default color scale to fill the columns
    11. use theme_minimal() to add a nice theme
    12. customize the lables with labs()

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Arabic_Saudi Arabia.1256 
## [2] LC_CTYPE=Arabic_Saudi Arabia.1256   
## [3] LC_MONETARY=Arabic_Saudi Arabia.1256
## [4] LC_NUMERIC=C                        
## [5] LC_TIME=Arabic_Saudi Arabia.1256    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
##  [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0  
##  [9] tidyverse_1.2.1 lubridate_1.7.4 fuzzyjoin_0.1.4
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5    xfun_0.8            haven_2.1.1        
##  [4] lattice_0.20-38     colorspace_1.4-1    vctrs_0.2.0        
##  [7] generics_0.0.2      htmltools_0.3.6     viridisLite_0.3.0  
## [10] yaml_2.2.0          utf8_1.1.4          rlang_0.4.0        
## [13] pillar_1.4.2        glue_1.3.1          withr_2.1.2        
## [16] modelr_0.1.4        readxl_1.3.1        munsell_0.5.0      
## [19] gtable_0.3.0        cellranger_1.1.0    rvest_0.3.4        
## [22] htmlwidgets_1.3     kableExtra_1.1.0    evaluate_0.14      
## [25] labeling_0.3        knitr_1.23          curl_4.0           
## [28] fansi_0.4.0         highr_0.8           broom_0.5.2        
## [31] Rcpp_1.0.2          scales_1.0.0        backports_1.1.4    
## [34] webshot_0.5.1       jsonlite_1.6        countrycode_1.1.0  
## [37] hms_0.5.0           digest_0.6.20       stringi_1.4.3      
## [40] rprojroot_1.3-2     grid_3.6.1          here_0.1           
## [43] cli_1.1.0.9000      tools_3.6.1         magrittr_1.5       
## [46] lazyeval_0.2.2      crayon_1.3.4        pkgconfig_2.0.2    
## [49] zeallot_0.1.0       ellipsis_0.2.0.9000 xml2_1.2.1         
## [52] assertthat_0.2.1    rmarkdown_1.14      httr_1.4.0         
## [55] rstudioapi_0.10     R6_2.4.0            nlme_3.1-140       
## [58] compiler_3.6.1