An Introduction to Data Visualization

Introduction
Packages
Data
Scatter Plots
Line Charts
Bar Charts
Chord Diagrams
Summary
Interactive Visualizations
Conclusion and Other Resources

Introduction

As researchers and analysts, we utilize exploratory analysis on a regular basis. It is likely that you have turned to Excel or a “point-and-click” program to visualize your data in the past. As we’ve discussed in previous videos, however, using R to build out your workflows provides several advantages, such as automation and reproducibility.

In this quick tutorial, we will use some basic data visualization techniques in R to explore a data set pertaining to narcotics trafficking in a hypothetical country. Specifically, the problem we hope to understand is that a cartel, let’s call them “El Cartel,” is trafficking narcotics from which revenues are used to buy weapons and political power, which we will assume is contributing to violence in their hypothetical country.

From this topic we can ask several questions, such as the following:

Is there a correlation between El Cartel’s narcotics related violence and state narcotics seizures? How about narcotics production and violence?
Has the cartel’s violence gotten worse over time?
How can we characterize the violence during the last year?
El Cartel is comprised of several sub-organizations. Which sub-organizations have been involved in violence? Is violence dominated by specific sub-organizations? Are those sub-organizations collaborating?

As you will see, the choice of a type of visualization depends on several factors, including the type of data with which you are working and what your purpose is (i.e., exploratory vs. explanatory).

From our “answers” to the questions above, we can develop hypotheses that we can test using more sophisticated techniques and statistical models. For demonstration purposes, however, we will keep things simple and limit our ourselves to data exploration and some basic informative techniques. Thus, we will explore our data and then describe our results to a hypothetical audience.

Packages

We will leverage five packages in the tutorial. The functions listed in Table 1 are the primary functions we will use, but they do not represent an exhaustive list of the functions and arguments provided below (or that each package offers). The goal of Table 1 is simply to provide you with a quick preview; that is, our brief descriptions do not “do justice” for these excellent packages, so we recommend you check out their websites.

*Table 1: Summary of Chapter Packages and Functions*
Package	Function	Short Description
ggplot2	`ggplot()` & `facet_wrap()`	A tidyverse ¹ package, based on “The Grammar of Graphics,” to create data visualizations. The functions listed in Column 2 allow us to create visualization, and create multi-panel plots, respectively.²
dplyr	`mutate()`, `group_by()`, & `summarise()`	A tidyverse package for data manipulation. The functions listed in Column 2 allow us to create new and transform variables while maintaining existing ones, as well as split data into groups and obtain summary statistics on grouped data. ³
lubridate	`month()` &`as_date()`	A tidyverse package to work with date-time data in an intuitive way. The functions in Column 2 allow us to specify the type of date-time data with which we are working. Here we use base R’s `as.date` function; we could have used lubridate’s `as_date`. ⁴
circlize	`chordDiagram()`	This package permits users to design circular data visualizations, such as chord diagrams. ⁵
plotly	`plotly()`	A package that allows users to create interactive, “publication-quality graphs”, including in online environments. ⁶

You will need to make sure these packages are installed (install.packages) before calling them from your library.

Data

The data we will use here is not a real-world data set, but it mimics many real-world data sets of its kind. We gathered and structured these data from narcotics activities across a variety of contexts.

Narcotics Events (i.e., Narco_Events.csv): This data set contains records of El Cartel-inflicted violence in the hypothetical country. It contains five variables:
- Event_ID: A unique identifier for each event.
- Date: The date on which the event occurred.
- Sub_Organization: The perpetrating sub-unit of an event.
- Event_Type: The type of an event that occurred.
- Casualties: The count of civilian casualties as a result of the event.
Drugs (i.e., Drugs.csv): This data set provides descriptive information about 29 hypothetical locations in which El Cartel maintains some level of influence. It contains 8 variables:
- Location: The name of a location.
- Opium_Cultivation_Hectares: The estimated amount of opium cultivation in each location (measured in hectares).
- Coca_Cultivation_Hectares: The estimated amount of coca cultivation in each location (measured in hectares).
- Opium_Production_Tons: The estimated amount of opium production in each location (measured in tons).
- Coca_Production_Tons: The estimated amount of coca production in each location (measured in tons).
- Violent_Events: The count of El Cartel-initiated violent events in each location.
- Active_Militia: A binary variable indicating the presence (i.e., “Yes”) or absence (i.e., “No”) of active, pro-government militias in each location.
Cartel Financial Flows (i.e., El_Cartel_Net.Mat.csv):This data set contains financial flows among El Cartel’s sub-units. The values in this matrix’s cells are in U.S. dollars.

As you may have noticed in other tutorials, several ways exist to import your data. In this example, we will import our data using read.csv() and create a set of data frames.

events_df <- as.data.frame(read.csv(file="data/Narco_Events.csv", header=TRUE))
drugs_df <- as.data.frame(read.csv(file="data/Drugs.csv", header=TRUE))
el_cartel<- read.csv(file="data/El_Cartel_Net_Mat.csv", row.names = 1)
el_cartel_mat<- as.matrix(el_cartel) # We will come back to this. Our initial file is stored as a matrix for a chord diagram. We have to ensure R reads it as a matrix, so we have to do an extra step here during import.

Scatter Plots

The first set of questions we hope to explore is, “is there a correlation between El Cartel’s narcotics related violence and state narcotics seizures?” “How about narcotics production and violence?”

Based on these question, we will utilize a series of scatter plots. These graphs are used to show one quantitative variable relative to another. In the “Drugs” (i.e., drugs_df) data set, we have a series of variables, two of which pertain to our first question: Events and Seizures.

Based on Healy’s (2019) excellent text, we will build our scatter plot using the following steps:

Tell ggplot2 the data set from which you want to build your scatter plot using with the ggplot(data = ) function and argument (note:the “sp1” stands for “scatter plot1”):

sp1<-ggplot(data = drugs_df)

Tell ggplot2 what variables you want to include by adding the mapping argument/function:

sp1<-ggplot(data = drugs_df,
            mapping = aes (x = Violent_Events,
                           y = Seizures))

Next, add a layer plot (i.e., tell it what kind of plot you want) to our sp1 object, which in this case is geom_point because we want to visualize a scatter plot:

sp1 + geom_point()

Figure 1: Scattter Plot of Seizures and Violence

Finally, we can recolor the points to blue, add a title, relabel the axes (labs()), and add a smoothed line (geom_smooth()), which gives us the standard error for the line.

sp1 + geom_point (color = "blue") + 
  geom_smooth() + # Note other options exist for `geom_smooth`, such as a linear model. 
  labs (x = " # of El Cartel Violent Events", y = "# of State Seizures",
        title = "Cartel Violence and State Drug Seizures By State (2018)",
        caption = "Hypothetical Open Source Data")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 2: Scattter Plot of Seizures and Violence

It is often the case that our exploratory analyses leads us to more exploratory analysis. Say, for example, we wanted to continue to look at the relationship between Events and Seizures, but this time based on whether there are active pro-government militias in the location.

By adding mapping = aes(color = Active_Militia) to the geom_point, we can color each location and maintain a single smoothed line. Note the legend appears by including this new function.

sp1 + geom_point (mapping = aes(color = Active_Militia)) + 
  geom_smooth() + # Note other options exist for `geom_smooth`, such as a linear model. 
  labs (x = " # of El Cartel Violent Events", y = "# of State Seizures",
        title = "Cartel Violence and State Drug Seizures By State (2018)",
        caption = "Hypothetical Open Source Data")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 3: Scattter Plot of Seizures and Violence with Active Militias

Note you can save each visualization using ggsave; for example, I can save my last scatter plot using the following:

We can go one step further and replace our binary variable (i.e., Active Militia) with a quantitative variable, such as “Opium_Production_Tons”. This approach gives us a bubble chart. This visualization helps us see the correlation between violence and seizures as well as adds a dimension for opium production. The disadvantage here is that we are getting close violating, if not already do so, the common mistake of including too many dimensions.

To build a bubble chart, you simply need to add a variable to size within the aes argument.

We are not starting from scratch here; we are building upon our existing scatter plot arguments. Had we started from scratch we would need to make sure that we tell ggplot2 which data set we want to use, the variables to map (i.e., mapping()), and a layer plot as well as any aesthetic changes we would want to make.

ggplot(data = drugs_df, aes (x = Violent_Events, y = Seizures, 
                             size = Opium_Production_Tons,
                             color = Active_Militia)) + 
  geom_point() +
  labs (x = " # of El Cartel Violent Events", y = "# of State Seizures",
        title = 'Cartel Violence, Drug Seizures, and Militia Presence (2018)',
        caption = "Hypothetical Open Source Data")

Figure 4: Bubble Chart of Seizures, Violence, and Militia Presence

Additionally, we could show multiple scatter plots (i.e., a scatter plot matrix) to see if we have interesting correlations between multiple variables. First, let’s focus on our quantitative variables, which is everything except the Active_Militia column (i.e., columns 2 through 7 in our data frame).

mat_sp_data<-drugs_df[ , c(2:7)] # Here we are calling the extracted columns "mat_sp_data", which stands for "matrix scatter plot data."

plot(mat_sp_data, pch=18, cex=1.25, col="blue") # Now, we can visualize the matrix using this simple plot function and adjust some of the labels and colors using `pch`, `cex`, and `col`arguments.

Figure 5: Scatterplot Matrix

You can take a look at the actual correlations using the cor function. Note we do not actually show them here.

cm<- cor(mat_sp_data)
round(cm, 2) # Let's round our decimals to 2 decimal points.

You can, however, look at the correlation coefficient only between Violent Events and Seizures using the short script below. We can see the correlation between the two is about .80, which is a strong, positive correlation by many standards.

cor(mat_sp_data$Violent_Events, mat_sp_data$Seizures)

## [1] 0.7967116

Line Charts

Our second question is, “has the cartel’s violence gotten worse over time?”

This question lends itself to a time series visualization. We can begin with a simple visualization to look at the impact of El Cartel’s violence during 2018. Specifically, we will look at violence by month. So, the first thing we will do is make sure that R recognizes that our “Date” column actually contains dates.

class(events_df$Date)# Note "Date" is currently a factor.

## [1] "factor"

To fix this, use the mutate() function and tell R your “Date” column does in fact comprise of dates.

events_df <- events_df %>%
  mutate(Date = as.Date(Date))# Now we see it's a date if we do class(events_df$Date) again.

Add a month column to your “events_df”.

events_df <- events_df %>%
  mutate(month = month(Date))

Let’s now group our data by month using the group_by() function as well as the summarise() function to add up all the casualties for each month. Note we created a new data frame called “events_df_month.”

events_df_month <- events_df %>%
  group_by(month) %>%
  summarise(sum_casualties = sum(Casualties))

We can now visualize the number of events by month. Notice most of the code for a line chart is similar to what we did for our scatter plots. The important difference is now we are going to use the geom_line() function instead of the geom_point() function.

lc1<-ggplot(data = events_df_month,
            mapping = aes(x = month,
                          y = sum_casualties))

lc1 + geom_line() +
    labs (x = " Month", y = "# Casualties",
        title = "Cartel Violence and Casualties (2018)",
        caption = "Hypothetical Open Source Data")

Figure 6: Linechart of Cartel Violence and Casualties

Figure 6 depicts the broader temporal trend of violence in 2018. What the previous visualization does not tell us, is how each type of event varied over time. For instance, we may want to compare the five types of cartel-initiated events over 2018. We will use the group_by() and summarise() functions again.

events_df_month2 <- events_df %>% #We created new data frame so we don't overwrite the old one.
  group_by(month, Event_Type)%>%
  summarise(sum_casualties = sum(Casualties))

Now we can visualize our line chart.

lc2<-ggplot(data = events_df_month2,
            mapping = aes(x = month,
                          y = sum_casualties))

lc2 + geom_line(aes(color = Event_Type)) +
  scale_x_discrete(limits = c(0,2,4,6,8,10,12)) +
  labs (x = " Month", y = "# Casualties",
        title = "Cartel Violence and Casualties by Event Type (2018)",
        caption = "Hypothetical Open Source Data")

Figure 7: Linechart of Cartel Violence and Casualties by Event Type

Now at first glance this looks fine; however, line charts often continue trends between points in time even though there may have not been any observations. For instance, we don’t have any kidnappings from April through November but the line shows a consistent pattern. This example demonstrates the difference between continuous and discrete time series data.

An alternative is a bar plot showing trends for each type of event separately. Note the stat = "identity" argument allows us to use counts on the y-axis and scale_x_discrete() function allows us to set the x-axis values. We will include the facet_wrap() function as well to help us create distinct bar plots for each type of event across the year.

bp1 <- ggplot(data = events_df_month2,
            mapping = aes(x = month, y = sum_casualties, color = Event_Type))
bp1 + geom_bar(stat = "identity") + 
  scale_x_discrete(limits = c(0,2,4,6,8,10,12)) + 
  facet_wrap(~Event_Type, ncol = 2)

Figure 8: Temporal View of El Cartel’s Violence by Type

Regardless, it is remains clear that El Cartel’s violence has increased sharply since October of last year, most notably in terms of Drug-Related Activity and Small Arms-based events.

Bar Charts

When we have a set of categories and we are interested in depicting a quantitative value (i.e., amount) for each category, we can turn to bar plots.⁷

We can see from the line charts that narcotics-related casualties have increased substantially over the last quarter of last year, which could lead us to the following question, “which sub-organizations are the major perpetrators of such violence?”

We can use bar charts to look at how much violence (as measured by casualties) each sub-organization perpetrated in 2018. For this section, we will switch back to our initial events data frame (i.e., “events_df”). You can type head(events_df) into your console or run the following code if you want to re-familiarize yourself with the data.

head(events_df)

Let’s also create a new data frame using dplyr’s built-in functions. What we want for the next few visualizations are key statistics for violence per group and the types of violence in which each sub-organization has participated. Again, we will turn to the group_by() and summarise()functions.

events_df_group <- events_df %>%
  group_by(Sub_Organization, Event_Type, month)%>%
  summarise(sum_casualties = sum(Casualties))

First, let’s look at each sub-organization’s impact in terms of casualties. To build a bar chart/plot in ggplot2, you can use the following code⁸:

bp2 <- ggplot(data = events_df_group,
            mapping = aes(x = Sub_Organization, y = sum_casualties, fill = Sub_Organization))
bp2 + geom_bar(stat = "identity") +# Note stat = "identity" allows us to use a count on the y-axes
    labs (title = "El Cartel's Sub-Org Violence by Casualties (2018)",
        caption = "Hypothetical Open Source Data")

Figure 9: Bar plot of El Cartel’s Violence by Sub-Organization

As we can see in the bar plot, Sub-Organizations 3, 4, and 5 are the prolific offenders (the colors are bit distracting but we will keep them for now to keep things simple). Let’s also build a stacked bar plot to see the types of violence in which each sub-organization inflicted casualties. The key difference in the script here is that we replaced our fill = Sub_Organization with the Event_Type variable.

bp3 <- ggplot(data = events_df_group,
            mapping = aes(x = Sub_Organization, y = sum_casualties, fill = Event_Type))
bp3 + geom_bar(stat = "identity") +# Note stat = "identity" allows us to use a count on the y-axes
    labs (title = "El Cartel's Sub-Org Violence by Casualties (2018)",
        caption = "Hypothetical Open Source Data")

Figure 10: Bar plot of El Cartel’s Violence by Sub-Organization

Similar to Figure 8, we can look at each sub-organization’s activity over time.

bp4 <- ggplot(data = events_df_group,
            mapping = aes(x = month, y = sum_casualties, fill = Sub_Organization)) # We'll use "fill" this time.
bp4 + geom_bar(stat = "identity") + 
  scale_x_discrete(limits = c(0,2,4,6,8,10,12)) + 
  facet_wrap(~Sub_Organization, ncol = 2)

Figure 11: Bar plot of El Cartel’s Violence by Sub-Organization

From these visualization, we can see that three prominent sub-organizations (i.e., sub-organizations 3-5) were involved in the most common types of violence, namely drug-related activity and small arms attacks. Furthermore, we can see increases in casualties caused by all three prominent sub-organizations during the last quarter of the year.

Chord Diagrams

We just examined a question about which sub-organizations were involved in various types of violence. Chord diagrams (and Sankey diagrams as well) diagrams are good to show relations among entities, especially weighted relations (or flows). Certainly, we could show a sociogram but we will cover that more in the social network analysis (SNA) section of this program.

To visualize a chord diagram, we use circlize’s chordDiagram() function. Note I’ve shorted the labels of the sub-organizations in the underlying data set (i.e., the “el_cartel_mat” we imported at the beginning of the tutorial) prior to import to make the visualization more appealing.

chordDiagram(el_cartel_mat)

Figure 12: Chord Diagram of Financial Flows

We can adjust some of the aesthetics to make this even more appealing. For instance, we can reorder the sectors (i.e., sub-organizations’ names) using the order argument.⁹

chordDiagram(el_cartel_mat, order = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8"))# We did not set the colors so your output colors may look different.

Figure 13: Chord Diagram of Financial Flows

We may want to emphasize Sub-organizations 3-5 because they were so prominent in El Cartel’s violence. Here we will color them, red, blue, and green (i.e., “grid.col”) as well as adjust the link transparency and maintain the order we established in the last visualization.

grid.col<-c(S1 = "grey", S2= "grey", S3 = "blue", S4 = "red", S5 = "green", S6 = "grey", S7 = "grey", S8 = "grey")
chordDiagram(el_cartel_mat, grid.col = grid.col, transparency = 0.25, order = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8"))

Figure 14: Chord Diagram of Financial Flows

From this visualization we can see that Sub-organizations 3-5 are involved in several collaborative relationships with one another and the other sub-organizations. The thickness of the links suggests they are sharing substantial finances to operate.

Summary

Let’s now summarize the answers to our questions so we can move into presentation mode:

Is there a correlation between El Cartel’s narcotics related violence and state narcotics seizures? How about narcotics production and violence?
Yes, we see a strong, positive relationship between El Cartel-led violence and narcotics seizures (.8). We do not, however, know of any causal relationship based on the type of techniques we’ve used.
Has the cartel’s violence gotten worse over time? How can we characterize the violence during the last year?
El Cartel’s violence increased sharply in the last quarter of 2018, most notably in terms of Drug-Related Activity and Small Arms-based events.
El Cartel is comprised of several sub-organizations. Which sub-organizations have been involved in violence? Is violence dominated by specific sub-organizations?
The three most prominent sub-organizations are Sub-organizations 3-5, all of which were heavily involved in drug-related activity and small arms attacks.
All three prominent sub-organizations inflicted an increasing amount of casualties during the last quarter of 2018.

How are those sub-organizations collaborating? * Sub-organizations 3-5 are involved in several collaborative relationships with one another and the other sub-organizations.

Interactive Visualizations

Once you’ve explored your data, the next step is to communicate the results to your audience (or move onto confirmatory approaches). With R you have many options to produce your reports (e.g., Markdown, which what you’re looking at) ¹⁰, briefs (e.g., Reveal JS ¹¹ and Xaringan ¹², and/or interactive tools/dashboards (e.g., R Shiny¹³ and flexdashboard ¹⁴). An in-depth tutorial of these options is beyond the scope of this write-up, but we highly recommend that you explore these options as you become more comfortable with R.

One package that works with all of these options is plotly. Though interactivity can distract from substance in some cases, we think it has many advantages generally speaking, such as the ability of consumers/audiences to explore data on their own.

As with the other packages we’ve used so far, we will keep things simple and build only a few of the same visualizations from above. This time, however, we will include some interactivity in our visualizations.

Building a scatter plot to address our first question is pretty straightforward.

plot_ly(data = drugs_df, x = ~Violent_Events, y = ~Seizures)

Figure 15: Interactive Scatter Plot 1

We can add a few more aesthetic properties with the following:

plot_ly(data = drugs_df, x = ~Violent_Events, y = ~Seizures, color = ~Active_Militia, marker = list(size =12))%>%
                          layout(title = 'Cartel Violence, Drug Seizures, and Militia Presence (2018)',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))

Figure 16: Interactive Scatter Plot 2

We can replicate our bubble charts from earlier as well:

plot_ly(data = drugs_df, x = ~Violent_Events, y = ~Seizures, color = ~Active_Militia, size = ~Opium_Production_Tons,
        text = ~paste("Location: ", Location))%>%
                          layout(title = 'Cartel Violence, Drug Seizures, and Militia Presence (2018)',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))

Figure 17: Interactive Bubble Chart 1

Conclusion and Other Resources

Remember, this tutorial is very basic and designed to get you interested in using R for data visualizations. Many useful resources exist that go in far more depth than this document. Here are a few resources (i.e., many other great ones exist; these are just some recent and great resources) to check out pertaining to data visualization in R:

Books:

Chang, Winston. 2019. R Graphics Cookbook: Practical Recipes for Visualizing Data (2nd Edition). Sebastopol, CA: O’Reilly Media, Inc.
Healy, Kieran. 2019. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton University Press.
Wilke, Claus O. 2019. Fundamentals of Data Visualization.Sebastopol, CA: O’Reilly Media, Inc.

Websites:

The R Graph Gallery, https://www.r-graph-gallery.com/.
DataCamp’s Data Visualization in R page, https://www.datacamp.com/courses/data-visualization-in-r.

References:

See tidyverse’s website, https://www.tidyverse.org/.↩
See ggplot2 at tidyverse’s website, https://ggplot2.tidyverse.org/.↩
See dplyr at tidyverse’s website, https://dplyr.tidyverse.org/.↩
See lubridate at tidyverse’s website,https://lubridate.tidyverse.org/.↩
https://cran.r-project.org/web/packages/circlize/index.html.↩
https://plot.ly/r/ .↩
It is useful to depict proportions using bar plots but for demonstration purposes we will stick with raw casualty counts.↩
See the forcats (https://forcats.tidyverse.org/package) for reordering bars based on values.↩
We’ve shortened the sub-organizations’ names for demonstration purposes.↩
https://rmarkdown.rstudio.com/.↩
https://cran.r-project.org/web/packages/revealjs/index.html.↩
https://github.com/yihui/xaringan.↩
https://shiny.rstudio.com/.↩
https://rmarkdown.rstudio.com/flexdashboard/.↩