class: center, middle, inverse, title-slide # The Joy of Data Wrangling in R ##
Crossing the Tidyverse ### Christopher Callaghan - CORE Lab ### 2019-08-07 --- # Motivation <br> .center[ ![](img/ds_cycle.png) ] <small> Source: Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc. </small> --- # Overview 1. Introduction to the **tidyverse** 🌌 2. Ingesting data with **readr** 📖 3. Data wrangling in **dplyr** 🧰 4. Working with text with **stringr** 🔤 5. Tidying data with **tidyr** 🧹 --- # tidyverse 101 A collection of R packages designed to for data science. - Shared grammar and design - Wide range of applications Install: ```r install.packages("tidyverse") ``` Launch: ```r library(tidyverse) ``` --- # Loading Data with **readr** 📖 ```r url <- "https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv" df <- read_csv(file = url) ``` ``` ## # A tibble: 243,891 x 21 ## external_author… author content region language publish_date ## <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 9.06e17 10_GOP "\"We … Unkno… English 10/1/2017 1… ## 2 9.06e17 10_GOP Marsha… Unkno… English 10/1/2017 2… ## 3 9.06e17 10_GOP Daught… Unkno… English 10/1/2017 2… ## 4 9.06e17 10_GOP JUST I… Unkno… English 10/1/2017 2… ## 5 9.06e17 10_GOP 19,000… Unkno… English 10/1/2017 2… ## 6 9.06e17 10_GOP "Dan B… Unkno… English 10/1/2017 2… ## 7 9.06e17 10_GOP 🐝🐝🐝 ht… Unkno… English 10/1/2017 2… ## 8 9.06e17 10_GOP '@Sena… Unkno… English 10/1/2017 2… ## 9 9.06e17 10_GOP As muc… Unkno… English 10/1/2017 3… ## 10 9.06e17 10_GOP After … Unkno… English 10/1/2017 3… ## # … with 243,881 more rows, and 15 more variables: harvested_date <chr>, ## # following <dbl>, followers <dbl>, updates <dbl>, post_type <chr>, ## # account_type <chr>, retweet <dbl>, account_category <chr>, ## # new_june_2018 <dbl>, alt_external_id <dbl>, tweet_id <dbl>, ## # article_url <chr>, tco1_step1 <chr>, tco2_step1 <chr>, ## # tco3_step1 <lgl> ``` --- # Data wrangling in **dplyr** 🧰 - How to handle extraneous variables? - Can I subset and rearrange my observations? - What is the easiest way to add new variables to my data set? - How can I use **dplyr** to gain quick insights about my data? .center[ ![](img/hex-dplyr.png) ] --- # How to handle extraneous variables? ```r df %>% * select(author, * content, * language, * publish_date, * post_type, * account_category) ``` ``` ## # A tibble: 243,891 x 6 ## author content language publish_date post_type account_category ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 10_GOP "\"We have a si… English 10/1/2017 1… <NA> RightTroll ## 2 10_GOP Marshawn Lynch … English 10/1/2017 2… <NA> RightTroll ## 3 10_GOP Daughter of fal… English 10/1/2017 2… RETWEET RightTroll ## 4 10_GOP JUST IN: Presid… English 10/1/2017 2… <NA> RightTroll ## 5 10_GOP 19,000 RESPECTI… English 10/1/2017 2… RETWEET RightTroll ## 6 10_GOP "Dan Bongino: \… English 10/1/2017 2… <NA> RightTroll ## 7 10_GOP 🐝🐝🐝 https://t.c… English 10/1/2017 2… RETWEET RightTroll ## 8 10_GOP '@SenatorMenend… English 10/1/2017 2… <NA> RightTroll ## 9 10_GOP As much as I ha… English 10/1/2017 3… <NA> RightTroll ## 10 10_GOP After the 'geno… English 10/1/2017 3… <NA> RightTroll ## # … with 243,881 more rows ``` --- # Can I subset and rearrange my observations? ```r df %>% select(author, content, language, publish_date, post_type, account_category) %>% * filter(post_type == "RETWEET" & language == "English") %>% * arrange(publish_date) ``` ``` ## # A tibble: 7 x 6 ## author content language publish_date post_type account_category ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 ALECMO… "Pusha T's firs… English 1/1/2016 18… RETWEET LeftTroll ## 2 ANTONH… None you weirdo… English 1/1/2016 18… RETWEET LeftTroll ## 3 ANTONH… James Surowieck… English 1/1/2016 18… RETWEET LeftTroll ## 4 ADRGRE… Gospel music--s… English 1/1/2016 18… RETWEET LeftTroll ## 5 ADRGRE… If Kim Kardashi… English 1/1/2016 18… RETWEET LeftTroll ## 6 ADRGRE… My best RTs thi… English 1/1/2016 18… RETWEET LeftTroll ## 7 AMELIE… "Join us LIVE o… English 1/1/2017 0:… RETWEET RightTroll ``` --- # What is the easiest way to add new variables to my data set? ```r df %>% select(author, content, language, publish_date, post_type, account_category) %>% filter(post_type == "RETWEET" & language == "English") %>% * mutate(political = if_else(account_category == "LeftTroll" | * account_category == "RightTroll", * "Political", "Not Political")) ``` ``` ## # A tibble: 7 x 7 ## author content language publish_date post_type account_category political ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 10_GOP Daught… English 10/1/2017 2… RETWEET RightTroll Political ## 2 10_GOP 19,000… English 10/1/2017 2… RETWEET RightTroll Political ## 3 10_GOP 🐝🐝🐝 ht… English 10/1/2017 2… RETWEET RightTroll Political ## 4 10_GOP BREAKI… English 10/11/2017 … RETWEET RightTroll Political ## 5 10_GOP Becaus… English 10/11/2017 … RETWEET RightTroll Political ## 6 10_GOP I am a… English 10/11/2017 … RETWEET RightTroll Political ## 7 10_GOP Do you… English 10/12/2017 … RETWEET RightTroll Political ``` --- # How can I use **dplyr** to gain quick insights about my data? ```r df %>% select(author, content, language, publish_date, post_type, account_category) %>% filter(post_type == "RETWEET" & language == "English") %>% mutate(political = if_else(account_category == "LeftTroll" | account_category == "RightTroll", "Political", "Not Political")) %>% * group_by(political) %>% * summarise(political_volume=n()) ``` ``` ## # A tibble: 2 x 2 ## political political_volume ## <chr> <int> ## 1 Not Political 19969 ## 2 Political 82060 ``` --- # How can I use **dplyr** to gain quick insights about my data? ```r df %>% select(author, content, language, publish_date, post_type, account_category) %>% * filter(language == "English") %>% mutate(political = if_else(account_category == "LeftTroll" | account_category == "RightTroll", "Political", "Not Political")) %>% group_by(political) %>% summarise(political_volume=n()) ``` ``` ## # A tibble: 2 x 2 ## political political_volume ## <chr> <int> ## 1 Not Political 41423 ## 2 Political 148829 ``` --- # Working with Text in stringr 🔤 - Why should I care about handing casing? - How do I determine the length of string? - What if I want to manipulate a string? - Can I find text patterns? - How do I extract patterns from strings? .center[ ![](img/hex-stringr.png) ] --- # Why should I care about handing casing? ```r string1 <- "Chris loves stringr" string2 <- "Chris loves Stringr" string1 == string2 ``` ``` ## [1] FALSE ``` ```r string1 <- str_to_upper(string1) string2 <- str_to_upper(string2) string1 == string2 ``` ``` ## [1] TRUE ``` --- # How do I determine the lenght of string? ```r df %>% select(content) %>% top_n(1) %>% * str_length() ``` ``` ## Selecting by content ``` ``` ## [1] 49 ``` ```r df %>% select(author, content, publish_date, post_type, account_category) %>% * mutate(content_length = str_length(content)) ``` ``` ## # A tibble: 5 x 6 ## author content publish_date post_type account_category content_length ## <chr> <chr> <chr> <chr> <chr> <int> ## 1 10_GOP "\"We have… 10/1/2017 1… <NA> RightTroll 156 ## 2 10_GOP Marshawn L… 10/1/2017 2… <NA> RightTroll 140 ## 3 10_GOP Daughter o… 10/1/2017 2… RETWEET RightTroll 143 ## 4 10_GOP JUST IN: P… 10/1/2017 2… <NA> RightTroll 145 ## 5 10_GOP 19,000 RES… 10/1/2017 2… RETWEET RightTroll 83 ``` --- # How do I determine the lenght of string? ```r df %>% select(author, content, publish_date, post_type, account_category) %>% * mutate(content_length = str_length(content)) %>% group_by(account_category) %>% * summarise(average_tweet_len = mean(content_length), * max_tweet_len = max(content_length), * min_tweet_len = min(content_length)) ``` ``` ## # A tibble: 8 x 4 ## account_category average_tweet_len max_tweet_len min_tweet_len ## <chr> <dbl> <int> <int> ## 1 Commercial 91.1 164 23 ## 2 Fearmonger 78.3 151 9 ## 3 HashtagGamer 73.9 168 3 ## 4 LeftTroll 103. 778 1 ## 5 NewsFeed 106. 163 32 ## 6 NonEnglish 100. 250 4 ## 7 RightTroll 115. 816 4 ## 8 Unknown 74.0 164 8 ``` --- # What if I want to manipulate a string? ```r df %>% select(author, content, publish_date, post_type, account_category) %>% * mutate(handles = str_c("@", author), content_length = str_length(content)) ``` ``` ## # A tibble: 243,891 x 7 ## author content publish_date post_type account_category handles ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 10_GOP "\"We … 10/1/2017 1… <NA> RightTroll @10_GOP ## 2 10_GOP Marsha… 10/1/2017 2… <NA> RightTroll @10_GOP ## 3 10_GOP Daught… 10/1/2017 2… RETWEET RightTroll @10_GOP ## 4 10_GOP JUST I… 10/1/2017 2… <NA> RightTroll @10_GOP ## 5 10_GOP 19,000… 10/1/2017 2… RETWEET RightTroll @10_GOP ## 6 10_GOP "Dan B… 10/1/2017 2… <NA> RightTroll @10_GOP ## 7 10_GOP 🐝🐝🐝 ht… 10/1/2017 2… RETWEET RightTroll @10_GOP ## 8 10_GOP '@Sena… 10/1/2017 2… <NA> RightTroll @10_GOP ## 9 10_GOP As muc… 10/1/2017 3… <NA> RightTroll @10_GOP ## 10 10_GOP After … 10/1/2017 3… <NA> RightTroll @10_GOP ## # … with 243,881 more rows, and 1 more variable: content_length <int> ``` --- # What if I want to manipulate a string? ```r df %>% select(author, content, publish_date, post_type, account_category) %>% mutate(handles = str_c("@", author), content_length = str_length(content)) %>% * group_by(handles) %>% * summarise(average_tweet_len = mean(content_length), * max_tweet_len = max(content_length), * min_tweet_len = min(content_length)) ``` ``` ## # A tibble: 7 x 4 ## handles average_tweet_len max_tweet_len min_tweet_len ## <chr> <dbl> <int> <int> ## 1 @10_GOP 110. 172 20 ## 2 @1488REASONS 78.2 159 23 ## 3 @1D_NICOLE_ 59.9 134 13 ## 4 @1ERIK_LEE 109 121 97 ## 5 @1LORENAFAVA1 110. 182 34 ## 6 @2NDHALFONION 81 92 75 ## 7 @459JISALGE 129 129 129 ``` --- # Can I find text patterns? ```r df %>% filter(language == "English") %>% select(content) ``` ``` ## # A tibble: 10 x 1 ## content ## <chr> ## 1 "\"We have a sitting Democrat US Senator on trial for corruption and yo… ## 2 Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagg… ## 3 Daughter of fallen Navy Sailor delivers powerful monologue on anthem pr… ## 4 JUST IN: President Trump dedicates Presidents Cup golf tournament troph… ## 5 19,000 RESPECTING our National Anthem! #StandForOurAnthem🇺🇸 https://t.c… ## 6 "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exac… ## 7 🐝🐝🐝 https://t.co/MorL3AQW0z ## 8 '@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't repo… ## 9 As much as I hate promoting CNN article, here they are admitting EVERYT… ## 10 After the 'genocide' remark from San Juan Mayor the narrative has chang… ``` --- # Can I find text patterns? ```r df %>% select(author, content, publish_date, account_category) %>% mutate(handles = str_c("@", author), * has_mentions = str_detect(content, "@\\w+")) ``` ``` ## # A tibble: 243,891 x 6 ## author content publish_date account_category handles has_mentions ## <chr> <chr> <chr> <chr> <chr> <lgl> ## 1 10_GOP "\"We have a … 10/1/2017 1… RightTroll @10_GOP TRUE ## 2 10_GOP Marshawn Lync… 10/1/2017 2… RightTroll @10_GOP FALSE ## 3 10_GOP Daughter of f… 10/1/2017 2… RightTroll @10_GOP FALSE ## 4 10_GOP JUST IN: Pres… 10/1/2017 2… RightTroll @10_GOP FALSE ## 5 10_GOP 19,000 RESPEC… 10/1/2017 2… RightTroll @10_GOP FALSE ## 6 10_GOP "Dan Bongino:… 10/1/2017 2… RightTroll @10_GOP FALSE ## 7 10_GOP 🐝🐝🐝 https://t… 10/1/2017 2… RightTroll @10_GOP FALSE ## 8 10_GOP '@SenatorMene… 10/1/2017 2… RightTroll @10_GOP TRUE ## 9 10_GOP As much as I … 10/1/2017 3… RightTroll @10_GOP FALSE ## 10 10_GOP After the 'ge… 10/1/2017 3… RightTroll @10_GOP TRUE ## # … with 243,881 more rows ``` --- # Can I find text patterns? ```r df %>% select(author, content, publish_date, account_category) %>% mutate(handles = str_c("@", author), * has_mentions = str_detect(content, "@\\w+")) %>% * group_by(account_category) %>% * summarise(mentions = sum(has_mentions == TRUE), * no_mentions = sum(has_mentions == FALSE)) ``` ``` ## # A tibble: 8 x 3 ## account_category mentions no_mentions ## <chr> <int> <int> ## 1 Commercial 0 339 ## 2 Fearmonger 52 332 ## 3 HashtagGamer 3492 23857 ## 4 LeftTroll 10404 25668 ## 5 NewsFeed 4 11287 ## 6 NonEnglish 5035 48003 ## 7 RightTroll 15403 99407 ## 8 Unknown 49 559 ``` --- # How do I extract patterns from strings? ```r df %>% select(author, content, publish_date) %>% rename(handle = author, tweet = content) %>% * mutate(first_mention = str_extract(tweet, "@(\\w+)"), * all_mentions = str_extract_all(tweet, "@(\\w+)")) ``` ``` ## # A tibble: 243,891 x 5 ## handle tweet publish_date first_mention all_mentions ## <chr> <chr> <chr> <chr> <list> ## 1 10_GOP "\"We have a sitting De… 10/1/2017 19… @nedryun <chr [1]> ## 2 10_GOP Marshawn Lynch arrives … 10/1/2017 22… <NA> <chr [0]> ## 3 10_GOP Daughter of fallen Navy… 10/1/2017 22… <NA> <chr [0]> ## 4 10_GOP JUST IN: President Trum… 10/1/2017 23… <NA> <chr [0]> ## 5 10_GOP 19,000 RESPECTING our N… 10/1/2017 2:… <NA> <chr [0]> ## 6 10_GOP "Dan Bongino: \"Nobody … 10/1/2017 2:… <NA> <chr [0]> ## 7 10_GOP 🐝🐝🐝 https://t.co/MorL3A… 10/1/2017 2:… <NA> <chr [0]> ## 8 10_GOP '@SenatorMenendez @Carm… 10/1/2017 2:… @SenatorMene… <chr [2]> ## 9 10_GOP As much as I hate promo… 10/1/2017 3:… <NA> <chr [0]> ## 10 10_GOP After the 'genocide' re… 10/1/2017 3:… @CNN <chr [1]> ## # … with 243,881 more rows ``` --- # Tidying data with **tidyr** 🧹 - How do I handle list columns? <br> .center[ ![](img/hex-tidyr.png) ] --- # How do I handle list columns? ```r df %>% mutate(all_mentions = str_extract_all(content, "@(\\w+)")) %>% select(author, publish_date, all_mentions) ``` ``` ## # A tibble: 3 x 3 ## author publish_date all_mentions ## <chr> <chr> <list> ## 1 10_GOP 10/1/2017 19:58 <chr [1]> ## 2 10_GOP 10/1/2017 22:43 <chr [0]> ## 3 10_GOP 10/1/2017 22:50 <chr [0]> ``` ```r df %>% mutate(all_mentions = str_extract_all(content, "@(\\w+)")) %>% select(author, publish_date, all_mentions) %>% * unnest() ``` ``` ## # A tibble: 3 x 3 ## author publish_date all_mentions ## <chr> <chr> <chr> ## 1 10_GOP 10/1/2017 19:58 @nedryun ## 2 10_GOP 10/1/2017 2:52 @SenatorMenendez ## 3 10_GOP 10/1/2017 2:52 @CarmenYulinCruz ``` --- # How do I handle list columns? ```r library(igraph) g <- df %>% mutate(all_mentions = str_extract_all(content, "@(\\w+)"), author = str_c("@", author)) %>% select(author, all_mentions) %>% unnest() %>% * graph_from_data_frame() %>% * set.graph.attribute("density", edge_density(.)) %>% * set.graph.attribute("avg_degree", mean(degree(.))) %>% * set.graph.attribute("avg_clu_coef", transitivity(., "average")) ``` <table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;"> <caption>Global Graph Metrics</caption> <thead> <tr> <th style="text-align:right;"> Density </th> <th style="text-align:right;"> Avg..Degree </th> <th style="text-align:right;"> Avg..Clustering.Coefficient </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4.739 </td> <td style="text-align:right;"> 4.739 </td> </tr> </tbody> </table> --- ## Parting Thoughs and Additional Resources <br> .center[ ![](img/ds_cycle.png) ] .center[ Happy R learning! 🙋 ]