Introduction

The goal of this lesson is to provide you with foundational skills for handling spatial data. You’ll learn what vector and raster data are, how to represent and manipulate spatial geometries, build static and interactive maps, and work with different coordinate systems.

We’ll finish the lesson with a case study using real-world SIGACTs data to analyze IEDs in Kabul.

Setting Up

  1. Start with a clean R workspace.
    • If you’re using RStudio, you can click Session and Restart R, or just press Ctrl + Shift + F10.
  2. Check and install any missing packages you’ll need for this lesson.
  1. Load the main packages we’ll use.
  1. Set default plotting theme.

Peak Ahead

In this lesson, we’ll be using functions from many packages, but here’s a glimpse at some of the spatial functions we’ll be using.

Function Description
sf::st_as_sf() Convert objects to sf objects.
sf::st_transform() Convert between coordinate systems.
sf::st_buffer() Buffer spatial geometry distances.
sf::st_crop() Crop spatial data.
sf::st_join() Join separate spatial data.
ggplot2::geom_sf() Use ggplot2 to visualize sf geometries.
leaflet::leaflet() Start a leaflet web map.
leaflet::addPolygons() Add polygon geometries to a leaflet web map.
leaflet::addPoints() Add point geometries to a leaflet web map.


Spatial Data

What do we mean when we say “spatial data”?

When we say spatial data, we’re referring to information that describes the location of objects and events. Spatial data answers the question "Where, in space, is some phenomenon?".

Generally, spatial data falls into two categories: vector and raster.

Vector Data

Vector data represent the world using points, lines, and polygons. Spatial phenomena such as buildings (points), roads (lines), and the shapes of a bodies of water (polygons), are all things we would likely describe using vector data.

Show Non-Lesson Code

Raster Data

Raster data represent the world as a grid of individual cells, or pixels, all of which are the same size. Each pixel contains a numeric value that refers to some spatial phenomenon.

If you have ever looked at satellite imagery, you’ve looked at raster data, where the multiple grids are stacked, each containing the values of a visible color or other wavelength on the electromagnetic spectrum.

Show Non-Lesson Code

Consider the following image displaying elevation data. The value of any given pixel represents the elevation of the land it contains.

Show Non-Lesson Code

While the underlying data may be very different, we can also user vector raster data together.

Show Non-Lesson Code

geometry

Spatial vector data represent the world as a collection of points which, for two-dimensional data, are stored as \(x\) and \(y\) coordinates.

Points can be joined in order to make lines, which themselves can be joined to make polygons.

{sf}, the package emphasized here, refers to Simple Features, which is an ontology that represents spatial vector data in such a way that we can pack our spatial data into a data.frame column.

{sf} uses a different shape, or geometry, to represent points, lines, and polygons. Let’s take a tour of the seven geometries you’re most likely to encounter.

Before we get started, we’ll create a tibble data.frame consisting of coordinates, labels, and groups.

## # A tibble: 8 x 4
##     lng   lat lab   grp  
##   <dbl> <dbl> <chr> <chr>
## 1   -20   -20 A     a    
## 2   -20    10 B     a    
## 3   -10    20 C     a    
## 4   -10   -10 D     a    
## 5    20   -10 E     b    
## 6    20    10 F     b    
## 7    10    20 G     b    
## 8    10   -20 H     b

POINT

POINT refers to the location of a single point in space.

Here, we use st_as_sf() to convert a regular data.frame into an sf object.

  • Steps:
    1. take practice_coords
    2. convert to sf object with st_as_sf(), providing a character vector indicating the names of practice_coords in \((x, y)\) / \((longitude, latitude)\) order
    3. mutate() to a add a column named shape, which we obtain using st_geometry_type().
## Simple feature collection with 8 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 8 x 4
##   lab   grp    geometry shape
## * <chr> <chr>   <POINT> <fct>
## 1 A     a     (-20 -20) POINT
## 2 B     a      (-20 10) POINT
## 3 C     a      (-10 20) POINT
## 4 D     a     (-10 -10) POINT
## 5 E     b      (20 -10) POINT
## 6 F     b       (20 10) POINT
## 7 G     b       (10 20) POINT
## 8 H     b      (10 -20) POINT

The data in our lng and lat columns are moved to a new geometry column.

MULTIPOINT

MULTIPOINT refers to a collection of POINTs.

  • Steps:
    1. take point_sf
    2. using group_by(), group the rows together based on the values in their grp column
    3. summarise() each group, which combines the points of each group into a MULTIPOINT
    4. mutate() the shape column to change it to the new st_geometry_type()
## Simple feature collection with 2 features and 2 fields
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 2 x 3
##   grp                             geometry shape     
## * <chr>                       <MULTIPOINT> <fct>     
## 1 a     (-20 -20, -20 10, -10 -10, -10 20) MULTIPOINT
## 2 b         (10 -20, 10 20, 20 -10, 20 10) MULTIPOINT

Instead of the 8 separate POINTs with which we started, we now have 2 rows of MULTIPOINTs, each of which contain 4 points.

LINESTRING

LINESTRING is how we represent individual lines.

  • Steps:
    1. take multi_point_sf
    2. cast the geometry to= LINESTRING using st_cast() 3 mutate() the shape column to change it to the new st_geometry_type()
## Simple feature collection with 2 features and 2 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 2 x 3
##   grp   shape                                geometry
## * <chr> <fct>                            <LINESTRING>
## 1 a     LINESTRING (-20 -20, -20 10, -10 -10, -10 20)
## 2 b     LINESTRING     (10 -20, 10 20, 20 -10, 20 10)

Now we have 2 rows that each contain a LINESTRING, which was built by connecting each point to the next.

MULTILINESTRING

Similar to MULTIPOINTs that contain multiple points, we also have MULTILINESTRINGs.

  • Steps:
    1. take linestring_sf
    2. summarise() the rows, combining them all into a single MULTILINESTRING
    3. mutate() the shape column to change it to the new st_geometry_type() and replace the grp column that is dropped when we summarise() without using group_by()
## Simple feature collection with 1 feature and 2 fields
## geometry type:  MULTILINESTRING
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 1 x 3
##                                                  geometry shape       grp  
## *                                       <MULTILINESTRING> <fct>       <chr>
## 1 ((-20 -20, -20 10, -10 -10, -10 20), (10 -20, 10 20, 2~ MULTILINES~ multi

Now we have 2 lines embedded inside a single MULTILINESTRING row.

POLYGON

POLYGONs are essentially sets of lines that close to form a ring, although POLGYONs can also contain holes. We can easily wrap a shape around any geometry using st_convex_hull() to form a convex hull polygon.

  • Steps:
    1. take point_sf
    2. using group_by(), group the rows together based on the values in their grp column
    3. summarise() each group, combining them into MULTIPOINTs
    4. wrap the MULTIPOINTs in a polygon using st_convex_hull()
    5. mutate() the shape column to change it to the new st_geometry_type()
## Simple feature collection with 2 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 2 x 3
##   grp                                        geometry shape  
## * <chr>                                     <POLYGON> <fct>  
## 1 a     ((-20 -20, -20 10, -10 20, -10 -10, -20 -20)) POLYGON
## 2 b          ((10 -20, 10 20, 20 10, 20 -10, 10 -20)) POLYGON

MULTIPOLYGON

POLYGONs can also be grouped together to form MULTIPOLYGONs.

  • Steps:
    1. take polygon_sf
    2. summarise() the rows, combining them all into a single MULTILIPOLYGON
    3. mutate() the shape column to change it to the new st_geometry_type() and replace the grp column that is dropped when we summarise() without using group_by()
## Simple feature collection with 1 feature and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 1 x 3
##                                                    geometry shape     grp  
## *                                            <MULTIPOLYGON> <fct>     <chr>
## 1 (((-20 -20, -20 10, -10 20, -10 -10, -20 -20)), ((10 -20~ MULTIPOL~ multi

GEOMETRY

GEOMETRY is a special geometry type. It refers to a column of mixed geometries, i.e. we have multiple geometry types in our geometry column.

Show Non-Lesson Code

## Simple feature collection with 16 features and 2 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 20 ymax: 20
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 16 x 3
##    grp                                                 geometry shape      
##  * <chr>                                             <GEOMETRY> <fct>      
##  1 1                                            POINT (-20 -20) POINT      
##  2 2                                             POINT (-20 10) POINT      
##  3 3                                             POINT (-10 20) POINT      
##  4 4                                            POINT (-10 -10) POINT      
##  5 5                                             POINT (20 -10) POINT      
##  6 6                                              POINT (20 10) POINT      
##  7 7                                              POINT (10 20) POINT      
##  8 8                                             POINT (10 -20) POINT      
##  9 a              MULTIPOINT (-20 -20, -20 10, -10 -10, -10 20) MULTIPOINT 
## 10 b                  MULTIPOINT (10 -20, 10 20, 20 -10, 20 10) MULTIPOINT 
## 11 a              LINESTRING (-20 -20, -20 10, -10 -10, -10 20) LINESTRING 
## 12 b                  LINESTRING (10 -20, 10 20, 20 -10, 20 10) LINESTRING 
## 13 multi MULTILINESTRING ((-20 -20, -20 10, -10 -10, -10 20), ~ MULTILINES~
## 14 a      POLYGON ((-20 -20, -20 10, -10 20, -10 -10, -20 -20)) POLYGON    
## 15 b           POLYGON ((10 -20, 10 20, 20 10, 20 -10, 10 -20)) POLYGON    
## 16 multi MULTIPOLYGON (((-20 -20, -20 10, -10 20, -10 -10, -20~ MULTIPOLYG~

Understanding Real Spatial Data

Show Non-Lesson Code

For this section, we’ll use administrative boundary data describing Afghanistan.

First, here’s a simple function we’ll use to convert the objects to sf data frames carrying the tibble class.

Next, let’s get some data.

We can easily obtain GADM (Global Administrative Areas) data using the {raster} package. The following code accesses the GADM database and retrieves polygons for at the country (level = 0), province (level = 1), and district (level = 2) levels.

Here’s what one of the data sets looks like…

## Simple feature collection with 34 features and 10 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 60.50487 ymin: 29.36157 xmax: 74.89413 ymax: 38.49092
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 34 x 11
##    GID_0 NAME_0 GID_1 NAME_1 VARNAME_1 NL_NAME_1 TYPE_1 ENGTYPE_1 CC_1 
##    <chr> <chr>  <chr> <chr>  <chr>     <chr>     <chr>  <chr>     <chr>
##  1 AFG   Afgha~ AFG.~ Badak~ Badahšan  <NA>      Velay~ Province  <NA> 
##  2 AFG   Afgha~ AFG.~ Badgh~ Badghes|~ <NA>      Velay~ Province  <NA> 
##  3 AFG   Afgha~ AFG.~ Baghl~ Baglan|B~ <NA>      Velay~ Province  <NA> 
##  4 AFG   Afgha~ AFG.~ Balkh  Balh|Maz~ <NA>      Velay~ Province  <NA> 
##  5 AFG   Afgha~ AFG.~ Bamyan <NA>      <NA>      Velay~ Province  <NA> 
##  6 AFG   Afgha~ AFG.~ Dayku~ <NA>      <NA>      Velay~ Province  <NA> 
##  7 AFG   Afgha~ AFG.~ Farah  <NA>      <NA>      Velay~ Province  <NA> 
##  8 AFG   Afgha~ AFG.~ Faryab Fariab|M~ <NA>      Velay~ Province  <NA> 
##  9 AFG   Afgha~ AFG.~ Ghazni Gazni     <NA>      Velay~ Province  <NA> 
## 10 AFG   Afgha~ AFG.~ Ghor   Gawr|Gho~ <NA>      Velay~ Province  <NA> 
## # ... with 24 more rows, and 2 more variables: HASC_1 <chr>,
## #   geometry <POLYGON [°]>

Let’s walk through the information in the header.

  1. Simple feature collection with 34 features and 10 fields
    • The first line clues us into what sf stands for: Simple Features.
      • Simple features refers to a formal standard that describes how spatial geometries can be represented and stored by computers, including databases.
      • The first line also says with 34 features and 10 fields.
        • These numbers indicate the number of rows and number of non-spatial columns.
  2. geometry type: POLYGON
    • This tells us exactly what you think: our data contains spatial polygons.
  3. dimension: XY
    • This tells us that our data have two dimensions, such as "longitude" and "latitude".
  4. bbox: xmin: 60.50487 ymin: 29.36157 xmax: 74.89413 ymax: 38.49092
    • This is our data’s bounding-box: the maximum \(x\) and \(y\) coordinates.
      • If we were to plot these points, the resulting “square” would contain all of our data.
  5. epsg (SRID): 4326
    • This is our data’s EPSG (European Petroleum Survey Group) SRID (Spatial Reference Identifier).
      • epsgs are definitions of coordinate reference systems and transformations developed by EPSG and are widely used.
  6. proj4string: +proj=longlat +datum=WGS84 +no_defs
    • This is how our data are represented by the proj.4 coordinate reference system.
      • In addition to MGRS, EPSG and proj.4 are the systems you’re most likely to encounter.

Coordinate Reference Systems

These data uses an EPSG SRID of 4326. 4326 and 3857 are the EPSGs that you’re most likely to encounter in the wild. So, what’s the difference?

  • 4326 uses a coordinate system that represents data on a curved surface, like a globe.
  • 3857 uses a coordinate system that is projected to two dimensions, like a map.

You can think of 4326 as representing data the like Google Earth, while 3857 represents data like Google Maps.

The rest of the output should seem familiar by now. The second section (starting with # A tibble: 34 x 11) is much the same as the tibble data.frames you have seen so far. This is because sf objects are simply special data frames.

## [1] TRUE

One of the main benefits of using sf objects is precisely that they are still data.frames; we can now apply our previous knowledge of manipulating tabular data to the spatial domain.

Lastly, notice that afghan_1_sf has a column named geometry.

## Simple feature collection with 34 features and 0 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 60.50487 ymin: 29.36157 xmax: 74.89413 ymax: 38.49092
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 34 x 1
##                                                                    geometry
##                                                               <POLYGON [°]>
##  1 ((71.14804 36.00124, 71.14226 35.99334, 71.11366 35.9713, 71.10155 35.9~
##  2 ((63.09734 34.64551, 63.0825 34.65775, 63.07203 34.67375, 63.06917 34.6~
##  3 ((67.35538 34.88549, 67.35532 34.88847, 67.35298 34.88902, 67.34767 34.~
##  4 ((66.42347 35.64057, 66.44878 35.65366, 66.48534 35.66675, 66.49884 35.~
##  5 ((66.65279 34.00322, 66.65859 34.01669, 66.66908 34.03053, 66.67175 34.~
##  6 ((65.32353 33.12133, 65.3424 33.17651, 65.34425 33.17925, 65.35534 33.2~
##  7 ((61.12394 31.45273, 61.00227 31.46777, 60.84569 31.4869, 60.83062 31.5~
##  8 ((65.37444 35.44175, 65.37273 35.39013, 65.374 35.35973, 65.37601 35.35~
##  9 ((67.3762 32.08505, 67.36906 32.09193, 67.35442 32.11126, 67.341 32.123~
## 10 ((64.52828 33.32642, 64.5059 33.33145, 64.49678 33.3273, 64.49186 33.32~
## # ... with 24 more rows

This is how sf stores spatial geometry data. The <POLYGON [°]> type label below the column names tells us that R is indeed tracking these data as individual polygons with vertices stored as decimal degrees, which we use to represent points on curved surfaces.

Now that we understand what our data tell us, let’s build some maps.

Static Maps

We’re going to focus on using the ggplot2 package to visualize our spatial data and reinforce some of the patterns you’ve seen in previous lessons.

In addition to functions you saw in the Visualizing Data lesson, such as geom_line() and geom_point(), we can also easily plot our spatial data using geom_sf(), like so…

If we don’t want to fill our polygons with any color, we can specify fill=NA.

We can also specify aes()thetics based on columns in our sf objects just like regular data.frames.

If our data has a column of labels that we want to display, we can do so using geom_sf_test().

Since spatial data exist somewhere, it’s usually a good idea to provide some context telling your audience where your data are.

With that in mind, we should also include the countries neighboring Afghanistan, which are these…

We can expand on the technique we used to get Afghanistan data and get data for all countries in one swoop.

To do so, we can use purrr::map(), to iterate over the country names and combine them into a list of sf data frames, like so…

We can then bind the data frames by row combining do.call() and rbind(), like so…

We can crop those countries down to an area that surrounds Afghanistan using st_crop(). Since we want enough space to include country name labels, we’ll expand Afghanistan’s bounding box slightly using st_buffer().

## Simple feature collection with 7 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 59.75493 ymin: 28.61157 xmax: 75.64412 ymax: 39.24092
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   GID_0       NAME_0                       geometry
## 1   CHN        China POLYGON ((75.64412 36.76882...
## 2   IND        India POLYGON ((72.28305 28.61157...
## 3   IRN         Iran POLYGON ((59.75493 37.10242...
## 4   PAK     Pakistan POLYGON ((61.85089 28.61157...
## 5   TJK   Tajikistan POLYGON ((74.89049 37.23569...
## 6   TKM Turkmenistan POLYGON ((62.81617 35.30153...
## 7   UZB   Uzbekistan POLYGON ((67.79063 37.18518...

Finally, let’s put all the pieces together and build our finished map.

Interactive Maps

Building complex maps using static plots is a powerful method that allows us to customize the information we wish to communicate in finished products. However, we often need the ability to rapidly explore our data in an interactive fashion.

A popular tool for accomplishing this task is leaflet, which is a JavaScript library that builds interactive maps and renders them in internet browsers.

We can harness leaflet’s capabilities from R using the {leaflet} package, which can convert our R data and render maps without the user having any knowledge of JavaScript.

Before we walk through the basics of {leaflet}, let’s create some practice data.

We can create a variable named kabul_sf, which is an sf data frame containing a polygon roughly encircling Kabul, Afghanistan’s coordinates.

## Simple feature collection with 1 feature and 0 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 69.02034 ymin: 34.3518 xmax: 69.37966 ymax: 34.64793
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 1 x 1
##                                                                    geometry
##                                                               <POLYGON [°]>
## 1 ((69.37966 34.5, 69.37942 34.49225, 69.37868 34.48452, 69.37745 34.47683~

We start by calling the leaflet::leaflet() function. We then pipe (%>%) it to various leaflet::add*() functions, which add additional layers and tools to our map.

Here, we’ll call leaflet() then pipe it to addTiles().


While we can use leaflet::addTiles() to add a base map, we’re going to use leaflet::addProviderTiles() to customize our choice of base map.

When we load {leaflet} using library(leaflet), as we did at the beginning of this lesson, the variable providers is imported into our environment.

providers is a named list containing a set of web map tiles from which we can choose a base map.

If we want to use some satellite imagery as our base map, we can provide providers$Esri.WorldImagery as to leaflet::addProviderTiles()’s provider= argument, like so…


Now we can start adding our data to the map. Since kabul_sf is POLYGON data, we can use leaflet::addPolygons() to add it to our map like so…


Notice that our map automatically zoomed to our data.

In addition to POLYGON data, we can add other geometries using functions like leaflet::addMarkers() for POINT and data leaflet::addPolylines() for LINESTRING data.

While satellite imagery can offer powerful capabilities on its own, it’s best to include information that provides some spatial context.

Let’s also add some labels with another call to leaflet::addProviderTiles(), this time selecting providers$Hydda.RoadsAndLabels. We’ll also disable the internal color of our polygon using fill=NA.


Zoom in and out of the result and notice that the level of detail in our base map changes depending on the zoom level.

Now that we have the fundamentals of {leaflet} in place, let’s walk through a case study.

Case Study: Kabul IEDs

SIGACTS

We’re going to use publicly available data derived from SIGACTS reports.

The following function will automatically download and read the data into R…

After defining the function by running the above code, you can use it like so…

After that, we can take an initial glimpse() of sigacts_df to get an idea of what the data include.

## Observations: 431,547
## Variables: 9
## $ Field1               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ MGRS                 <chr> "41SNV7666", "41RQR7333", "42SVD9555", "4...
## $ PrimaryEventType     <chr> "Enemy Action", "Enemy Action", "Enemy Ac...
## $ PrimaryEventCategroy <chr> "Indirect Fire", "Direct Fire", "Indirect...
## $ Title                <chr> "(ENEMY ACTION) INDIRECT FIRE RPT Ghormac...
## $ DateOccurred         <dttm> 2008-01-01 00:00:00, 2008-01-01 00:45:00...
## $ export               <chr> "Sun Feb 12 22:59:47 2017", "Sun Feb 12 2...
## $ DDLat                <chr> "35.83523N", "31.90011N", "34.83728N", "3...
## $ DDLon                <chr> "063.84149E", "065.88659E", "068.94531E",...

MGRS

While the data contain longitude and latitude coordinates (DDLon and DDLat), we also have the original MGRS grid from which the coordinate pairs were derived. To make our work realistic, and also more accurately represent what the data actually tell us, we’ll use the MGRS grids instead.

Cleaning and Validating MGRS Strings

  1. We match the beginning of a string using "^", like so…


UTM Zone

Courtesy of NGA

Courtesy of NGA

  1. The first 2 characters are digits indicating a UTM Zone, which represents a 6°-wide slice of the globe.
    • RegEx Pattern: \\d{1,2}
      • \\d matches any digit (0-9)
        • \\d{1,2} matches any digit with a minimum length of 1 and consecutive matches up to a maximum length of 2


Since we’re walking through the individual pieces, we’ll iteratively build mgrs_regex, which we’ll use to validate all of our data at the end.

Here’s our current mgrs_regex.



Grid Zone Designator (GZD)

  1. The third character is the latitude band, specifying the Northern and Southern boundaries of the UTM Zone.
    • RegEx Pattern: [CDEFGHJKLMNPQRSTUVWXY]
      • can be any upper-case letter between C and X, except for I and O.
        • you can think of patterns inside [] as being one of these characters

Combining the UTM zone and latitude band gives us the Grid Zone Designator (GZD).

Here’s our updated mgrs_regex.



Next Letters

Courtesy of NGA

Courtesy of NGA

  1. The fourth and fifth characters are 2 upper-case letters identifying a 100,000 meter grid square inside the GZD.
    • RegEx Pattern: [A-Z]{1,2}
      • [A-Z] matches any upper-case letter
        • [A-Z]{1,2} matches any upper-case letter with a minimum length of 1 and consecutive matches up to a maximum length of 2

Here’s our updated mgrs_regex.



Easting and Northing
Courtesy of NGA

Courtesy of NGA

  1. The remaining characters are an even number of digits specifying Easting and Northing values inside the UPS.
    • RegEx Pattern: (\\d{2})+
      • \\d: any digit
      • \\d{2}: a sequence of 2 consecutive digits
      • (\\d{2})+: 1 or more consecutive sequences of 2 consecutive digits

Here’s our updated mgrs_regex.



  1. We then match the end of a string using "$".

Here’s the final version of mgrs_regex.



Now that we know how to match a valid MGRS coordinate, we can write a reusable function to validate our data.

If it finds the pattern, the function returns TRUE. Otherwise, it returns FALSE.

## # A tibble: 6 x 2
##   mgrs         is_valid
##   <chr>        <lgl>   
## 1 42SXF1839    TRUE    
## 2 41RQQ6992    TRUE    
## 3 41RPP0369    TRUE    
## 4 Unknown      FALSE   
## 5 42 S XF 1839 FALSE   
## 6 39EXF83914   FALSE

Using is_valid_mgrs(), let’s see if there any problems in init_sigacts_df

## # A tibble: 4 x 1
##   MGRS      
##   <chr>     
## 1 Unknown   
## 2 Unknown   
## 3 41SPR97456
## 4 42SWC 7 4

We have 4 strings that aren’t valid.

There’s nothing we can do for Unknown, so we’ll simply drop those rows

## # A tibble: 2 x 1
##   MGRS      
##   <chr>     
## 1 41SPR97456
## 2 42SWC 7 4

The problem with 42SWC 7 4 is that there is invalid whitespace. We can fix this by using stringr::str_remove_all() to remove all white space (pattern="\\s+").

## # A tibble: 1 x 1
##   MGRS      
##   <chr>     
## 1 41SPR97456

Lastly, 41SPR97456 is invalid because the Easting-Northing portion inside the UPS (97456) has an odd number of digits.

We can test for this in a simple function we’ll call is_odd_length_easting_northing() using the following steps…

  1. Using stringr::str_extract(), extract the Easting and Northing digits by matching consecutive digits followed by the end of the string (pattern = "\\d+$").
  2. Using stringr::str_count(), count the number of digits (pattern = "\\d").
  3. Using %% (pronounced “modulo”), get the remainder of dividing the digit count by 2
  4. Conduct an inequality test on the remainder.
    • if the remainder is 0, the count is an even number
      • the function then returns FALSE
    • if the remainder is not 0, the count is an odd number
      • the function then returns TRUE

First, we str_extract() the Easting and Northing digits using pattern="\\d+$.

## [1] TRUE

Fortunately, we only need to truncate the abbreviated MGRS coordinates; we just need to remove the last digit (6) from 41SPR97456.

We can do this using a stringr::str_remove(). The pattern= argument is more complicated here, but all it does is find a digit (\\d) that is followed by the end of the string ($).

Let’s write a function to do this for us, which we’ll call truncate_mgrs(), which looks like so…

## [1] "41SPR9745"

Finally, let’s put all the steps together and create a new variable called sigacts_df that only contains valid MGRS coordinates.

Then we can check sigacts_df to see if they’re all valid.

## [1] TRUE

Merging Data by MGRS

While converting MGRS data to any other system is possible in R, we’ll use public NGA data in order enforce correctness.

The following function downloads NGA’s 1 km MGRS grids for the desired GZD as shapefiles. It then performs an inner-join on a provided sf object, only returning grids present in both.

There’s a lot going on here that’s outside the scope of this lesson, but the main idea is that we can automate a the entire process of retrieving and prepping large amounts of data with a function that is only a maximum of 16 lines.

Using our kabul_sf polygon to only keep the desired area, we can run the function like so…

We can now inspect kabul_mgrs using the {leaflet} package.


Now that we have kabul_mgrs ready to go, we’ll join it with sigacts_df using an inner_join()

Joining Tables

An inner-join is used to combine tables (data.frames in R) so that only rows containing values matched between both the left-hand side and right-hand side are kept, with all columns from both sides added.

Basically, we use inner_join() to combine and filter data simultaneously.

Let’s do a quick example. Here are two data.frames.

When inner_join() these data.frames, only rows where the key column matches are returned

## # A tibble: 2 x 3
##   key   val_lhs val_rhs
##   <chr>   <dbl>   <dbl>
## 1 a           1       3
## 2 b           2       4

By default, inner_join() will match all the columns in both data.frames that have the same name, but in practice we should always eliminate ambiguity (and thus unsafety) by specifying which columns to match by providing an argument to inner_join()’s by= parameter.

## # A tibble: 2 x 3
##   key   val_lhs val_rhs
##   <chr>   <dbl>   <dbl>
## 1 a           1       3
## 2 b           2       4

If the columns names don’t match, we again provide an argument to by=, but we clarify which columns to match using a named vector.

## # A tibble: 2 x 3
##   key_lhs val_lhs val_rhs
##   <chr>     <dbl>   <dbl>
## 1 a             1       3
## 2 b             2       4

If you hadn’t guessed, inner_join() is just one kind of table join. We won’t discuss them in this lesson, but you can also use left_join(), right_join(), outer_join(), anti_join(), and more.

With that in mind, when we inner_join() sigacts_df with kabul_mgrs, only rows with matching MGRS columns will be returned. Since the result will now have the geometry from kabul_mgrs, we’ll also convert it using st_as_sf(). Finally, we’ll rename columns to be a little more explicit.

## Simple feature collection with 13374 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 69.01089 ymin: 34.35007 xmax: 69.38147 ymax: 34.6568
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 13,374 x 5
##    date_time           event_cat event_descripti~ MGRS 
##    <dttm>              <chr>     <chr>            <chr>
##  1 2008-01-01 19:30:00 Reported~ (COUNTER-INSURG~ 42SW~
##  2 2008-01-02 04:30:00 Meeting ~ 02 0430Z TF GLA~ 42SW~
##  3 2008-01-02 08:57:00 Other     01 January 08 N~ 42SW~
##  4 2008-01-02 09:02:00 Other     02 January 08 N~ 42SW~
##  5 2008-01-02 18:12:00 Indirect~ (ENEMY ACTION) ~ 42SW~
##  6 2008-01-03 04:00:00 Meeting ~ 030400Z Bagram ~ 42SW~
##  7 2008-01-03 05:02:00 Other     03 January 08 N~ 42SW~
##  8 2008-01-03 10:45:00 IED False (EXPLOSIVE HAZA~ 42SW~
##  9 2008-01-03 19:30:00 Attack T~ (THREAT REPORT)~ 42SW~
## 10 2008-01-04 04:53:00 Other     04 January 08 N~ 42SW~
## # ... with 13,364 more rows, and 1 more variable: geometry <POLYGON [°]>

Next, we’ll set up a base_map that includes both the sattelite imagery and road labels.

There are multiple IED-related categories in the data, but we’re only interested in those that indicate that an IED was definitely present. We’ll create a variable named ied_cats that we’ll use to filter kabul_ieds_sf.

## Simple feature collection with 398 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 69.0218 ymin: 34.35009 xmax: 69.38143 ymax: 34.64784
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 398 x 5
##    date_time           event_cat event_descripti~ MGRS 
##  * <dttm>              <chr>     <chr>            <chr>
##  1 2008-01-31 07:53:00 IED Expl~ (EXPLOSIVE HAZA~ 42SW~
##  2 2008-03-17 03:56:00 IED Foun~ (EXPLOSIVE HAZA~ 42SW~
##  3 2008-03-24 14:44:00 IED Expl~ (EXPLOSIVE HAZA~ 42SW~
##  4 2008-04-28 12:45:00 IED Foun~ (EXPLOSIVE HAZA~ 42SW~
##  5 2008-04-28 17:15:00 IED Foun~ (EXPLOSIVE HAZA~ 42SW~
##  6 2008-04-29 08:30:00 IED Foun~ (EXPLOSIVE HAZA~ 42SW~
##  7 2008-05-25 02:15:00 IED Expl~ (EXPLOSIVE HAZA~ 42SW~
##  8 2008-05-25 06:45:00 IED Expl~ (EXPLOSIVE HAZA~ 42SW~
##  9 2008-07-11 09:38:00 IED Foun~ (EXPLOSIVE HAZA~ 42SW~
## 10 2008-07-12 09:38:00 IED Foun~ (EXPLOSIVE HAZA~ 42SW~
## # ... with 388 more rows, and 1 more variable: geometry <POLYGON [°]>

Now let’s count() how many IEDs have been present per MGRS grid.

## Simple feature collection with 209 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 69.0218 ymin: 34.35009 xmax: 69.38143 ymax: 34.64784
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 209 x 3
##    MGRS         n                                                  geometry
##  * <chr>    <int>                                             <POLYGON [°]>
##  1 42SWD02~     2 ((69.03269 34.5397, 69.0218 34.5397, 69.0218 34.54872, 6~
##  2 42SWD03~     2 ((69.04358 34.51264, 69.03268 34.51265, 69.03269 34.5216~
##  3 42SWD03~     1 ((69.0436 34.54872, 69.0327 34.54872, 69.0327 34.55774, ~
##  4 42SWD03~     1 ((69.0436 34.55773, 69.0327 34.55774, 69.0327 34.56675, ~
##  5 42SWD04~     1 ((69.05447 34.51264, 69.04358 34.51264, 69.04358 34.5216~
##  6 42SWD04~     2 ((69.05448 34.52166, 69.04358 34.52166, 69.04359 34.5306~
##  7 42SWD04~     2 ((69.05449 34.53969, 69.04359 34.5397, 69.0436 34.54872,~
##  8 42SWD05~     1 ((69.06528 34.3954, 69.0544 34.3954, 69.0544 34.40442, 6~
##  9 42SWD05~     4 ((69.06537 34.52165, 69.05448 34.52166, 69.05448 34.5306~
## 10 42SWD05~     1 ((69.0654 34.55772, 69.0545 34.55773, 69.05451 34.56675,~
## # ... with 199 more rows

Before visualizing the results with leaflet, let’s create variable called ied_palthat generates a gradient using a yellow-orange-red palette. We’ll color each grid using ied_pal.

We’ll use leaflet::colorNumeric() to create our palette, like so…

Next, let’s combine our base_map with the polygon data in kabul_ied_counts_sf and a legend.


Data Enrichment with Open Street Map

Using the Open Street Map API, let’s add some more context to our data.

First, here’s get_osm_data(), a convenience wrapper to call the API.

For our purposes, let’s just look up a subset of different building types. We’ll assign them to a character vector variable named target_buildings.

Using kabul_sf and target_buildings, we can call our get_osm_data() function and retrieve data like so…

{osmdata} returns a list of sf data.frames with multiple geometries, but we’re only going to keep "osm_polygons" and "osm_multipolygons".

## Simple feature collection with 62 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 69.06514 ymin: 34.4484 xmax: 69.3591 ymax: 34.59946
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##      building                  geometry
## 1      mosque POINT (69.18226 34.53551)
## 2  university POINT (69.12147 34.47943)
## 3  university  POINT (69.12239 34.4797)
## 4  university POINT (69.12285 34.47918)
## 5  university  POINT (69.12347 34.4795)
## 6  university POINT (69.12309 34.47995)
## 7      school POINT (69.07718 34.49639)
## 8      mosque  POINT (69.16859 34.5361)
## 9      public POINT (69.17374 34.52116)
## 10      hotel POINT (69.16368 34.53144)





sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
##  [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0  
##  [9] tidyverse_1.2.1 stars_0.3-1     abind_1.4-5     sf_0.7-6       
## [13] leaflet_2.0.2  
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-140       lubridate_1.7.4    RColorBrewer_1.1-2
##  [4] webshot_0.5.1      httr_1.4.0         rprojroot_1.3-2   
##  [7] tools_3.6.1        backports_1.1.4    utf8_1.1.4        
## [10] rgdal_1.4-4        R6_2.4.0           KernSmooth_2.23-15
## [13] rgeos_0.4-3        DBI_1.0.0          lazyeval_0.2.2    
## [16] colorspace_1.4-1   raster_2.9-23      withr_2.1.2       
## [19] sp_1.3-1           tidyselect_0.2.5   curl_4.0          
## [22] compiler_3.6.1     cli_1.1.0.9000     rvest_0.3.4       
## [25] xml2_1.2.1         labeling_0.3       scales_1.0.0      
## [28] classInt_0.3-3     digest_0.6.20      rmarkdown_1.14    
## [31] jpeg_0.1-8         pkgconfig_2.0.2    htmltools_0.3.6   
## [34] highr_0.8          htmlwidgets_1.3    rlang_0.4.0       
## [37] readxl_1.3.1       rstudioapi_0.10    shiny_1.3.2       
## [40] generics_0.0.2     jsonlite_1.6       crosstalk_1.0.0   
## [43] magrittr_1.5       kableExtra_1.1.0   Rcpp_1.0.2        
## [46] munsell_0.5.0      fansi_0.4.0        stringi_1.4.3     
## [49] yaml_2.2.0         grid_3.6.1         parallel_3.6.1    
## [52] promises_1.0.1     crayon_1.3.4       lattice_0.20-38   
## [55] haven_2.1.1        hms_0.5.0          zeallot_0.1.0     
## [58] knitr_1.23         pillar_1.4.2       codetools_0.2-16  
## [61] glue_1.3.1         evaluate_0.14      modelr_0.1.4      
## [64] vctrs_0.2.0        httpuv_1.5.1       cellranger_1.1.0  
## [67] gtable_0.3.0       assertthat_0.2.1   xfun_0.8          
## [70] mime_0.7           xtable_1.8-4       broom_0.5.2       
## [73] e1071_1.7-2        later_0.8.0        class_7.3-15      
## [76] viridisLite_0.3.0  units_0.6-3        osmdata_0.1.1.004 
## [79] here_0.1