A Data Wrangling in R

In this part of our toolkit, we’re going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we’ll use R code to handle our spatial data.

Getting Started

R is a great choice for starting in data science because it’s built for it. It’s not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.

We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.

Tutorials for R

If you are new to R, we recommend the following intro-level tutorials provided through installation guides. You can also refer to this R for Social Scientists tutorial developed by Data Carpentry for a refresher.

You can also visit the RStudio Education page to select a learning path tailored to your experience level (Beginners, Intermediates, Experts). They offer detailed instructions to learners at different stages of their R journey.

A.1 Environmental Setup

Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.

Download the Activity Datasets

Please download and unzip this file to get started: SDOHPlace-DataWrangling.zip

Setting Up the Working Directory

Setting up a working directory in R is crucial as it defines the location on your computer where your files and scripts will be saved and accessed. You can set the working directory to any folder on your system where you plan to store your datasets and R scripts. To set your working directory, use the setwd("/path/to/your/directory") and specify the path to your desired directory.

Installing & Working with R Libraries

Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:

  • sf: simplifies spatial data manipulation
  • tmap: streamlines thematic map creation
  • dplyr: facilitates data manipulation
  • tidygeocoder: converts addresses to coordinates easily

Uncomment to install packages with code snippet below. You only need to install packages once in an R environment.

#install.packages("sf", "tmap", "tidygeocoder", "dplyr")

Installation Tip

For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the sf package. Using homebrew to install gdal usually fixes any remaining issues.

Now, loading the required libraries for further steps:

library(sf)
library(dplyr)
library(tmap)

A.2 Intro to Spatial Data

Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.

Load Spatial Data

We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let’s use chicagotracts.shp for practice, which includes the census tracts boundary in Chicago.

First, we need to read the shapefile data from where you save it.

Chi_tracts = st_read("SDOHPlace-DataWrangling/chicagotracts.shp")

Your output will look something like:

## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS:  WGS 84

Always inspect data when loading in. Let’s look at a non-spatial view.

head(Chi_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.68822 ymin: 41.72902 xmax: -87.62394 ymax: 41.87455
## Geodetic CRS:  WGS 84
##   commarea commarea_n countyfp10     geoid10 name10        namelsad10 notes statefp10 tractce10                       geometry
## 1       44         44        031 17031842400   8424 Census Tract 8424  <NA>        17    842400 POLYGON ((-87.62405 41.7302...
## 2       59         59        031 17031840300   8403 Census Tract 8403  <NA>        17    840300 POLYGON ((-87.68608 41.8229...
## 3       34         34        031 17031841100   8411 Census Tract 8411  <NA>        17    841100 POLYGON ((-87.62935 41.8528...
## 4       31         31        031 17031841200   8412 Census Tract 8412  <NA>        17    841200 POLYGON ((-87.68813 41.8556...
## 5       32         32        031 17031839000   8390 Census Tract 8390  <NA>        17    839000 POLYGON ((-87.63312 41.8744...
## 6       28         28        031 17031838200   8382 Census Tract 8382  <NA>        17    838200 POLYGON ((-87.66782 41.8741...

Check out the data structure of this file.

str(Chi_tracts)
## Classes 'sf' and 'data.frame':   801 obs. of  10 variables:
##  $ commarea  : chr  "44" "59" "34" "31" ...
##  $ commarea_n: num  44 59 34 31 32 28 65 53 76 77 ...
##  $ countyfp10: chr  "031" "031" "031" "031" ...
##  $ geoid10   : chr  "17031842400" "17031840300" "17031841100" "17031841200" ...
##  $ name10    : chr  "8424" "8403" "8411" "8412" ...
##  $ namelsad10: chr  "Census Tract 8424" "Census Tract 8403" "Census Tract 8411" "Census Tract 8412" ...
##  $ notes     : chr  NA NA NA NA ...
##  $ statefp10 : chr  "17" "17" "17" "17" ...
##  $ tractce10 : chr  "842400" "840300" "841100" "841200" ...
##  $ geometry  :sfc_POLYGON of length 801; first list element: List of 1
##   ..$ : num [1:243, 1:2] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:9] "commarea" "commarea_n" "countyfp10" "geoid10" ...

The data is no longer a shapefile but an sf object, comprised of polygons. The plot() command in R help to quickly visualizes the geometric shapes of Chicago’s census tracts. The output includes multiple maps because the sf framework enables previews of each attribute in our spatial file.

plot(Chi_tracts)

A.2.1 Adding a Basemap

Then, we can use tmap, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from Chi_tracts, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area’s census tracts.

We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5).

library(tmap)

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(Chi_tracts) + tm_borders(alpha=0.5) +
  tm_layout(title = "Census Tract Map of Chicago")

Still in the interactive mode (`view’), we can switch to a different basemap. Here we bring in a “Voyager” style map from Carto, a cartographic firm. We’ll make the borders less transparent by adjusting the alpha level.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_basemap("CartoDB.Voyager") + 
  tm_shape(Chi_tracts) + tm_borders(alpha=0.8, col = "gray40") +
  tm_layout(title = "Census Tract Map of Chicago")

Tip

For additional options, you can preview basemaps at the Leaflet Providers Demo. Some basemaps we recommend that work consistently are by:

  • CartoDB (Carto)
  • Open Street Map
  • ESRI

Not all basemaps are available anymore, and some require keys that you’d need to add on your own.

A.3 Coordinate Reference Systems

For this exercise we will use chicagotracts.shp to explore how to change the projection of a spatial dataset in R. First, let’s check out the current coordinate reference system.

st_crs(Chi_tracts)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

We can use the st_transform function to transform CRS. When projecting a dataset of Illinois, the most appropriate NAD83 projection would be NAD83 UTM zone 16N. Chicago sits within the area best covered by NAD83 / Illinois East (ftUS) (EPSG:3435).

Chi_tracts.3435 <- st_transform(Chi_tracts, "EPSG:3435")
# Chi_tracts.3435 <- st_transform(Chi_tracts, 3435)
st_crs(Chi_tracts.3435)
## Coordinate Reference System:
##   User input: EPSG:3435 
##   wkt:
## PROJCRS["NAD83 / Illinois East (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["SPCS83 Illinois East zone (US survey foot)",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",36.6666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-88.3333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.999975,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",984250,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United States (USA) - Illinois - counties of Boone; Champaign; Clark; Clay; Coles; Cook; Crawford; Cumberland; De Kalb; De Witt; Douglas; Du Page; Edgar; Edwards; Effingham; Fayette; Ford; Franklin; Gallatin; Grundy; Hamilton; Hardin; Iroquois; Jasper; Jefferson; Johnson; Kane; Kankakee; Kendall; La Salle; Lake; Lawrence; Livingston; Macon; Marion; Massac; McHenry; McLean; Moultrie; Piatt; Pope; Richland; Saline; Shelby; Vermilion; Wabash; Wayne; White; Will; Williamson."],
##         BBOX[37.06,-89.27,42.5,-87.02]],
##     ID["EPSG",3435]]

After change the projection, we can plot the map. We’ll swith to the static version of tmap, using the plot mode.

tmap_mode("plot")
## tmap mode set to plotting
tm_shape(Chi_tracts.3435) + tm_borders(alpha=0.5) + 
  tm_layout(main.title ="EPSG:3435 (ft)",
             main.title.position = "center")

What if we had used the wrong EPSG code, referencing the wrong projection? Here we’ll transform and plot EPSG Code 3561, a coordinate reference system in Haw’aii.

Chi_tracts.3561 <- st_transform(Chi_tracts, "EPSG:3561")


tm_shape(Chi_tracts.3561) + tm_borders(alpha=0.5) + 
  tm_layout(main.title ="EPSG:3561 (ft)",
             main.title.position = "center")

It’s obviously not correct – the wrong CRS can cause a load of trouble. Make sure you specify carefully!

Refine Basic Map

Let’s take a deeper look at the cartographic mapping package, tmap. We approach mapping with one layer at a time. Always start with the object you want to map by calling it with the tm_shape function. Then, at least one descriptive/styling function follows. There are hundreds of variations and paramater specification.

Here we style the tracts with some semi-transparent borders.

library(tmap)

tm_shape(Chi_tracts) + tm_borders(alpha=0.5) 

Next we fill the tracts with a light gray, and adjust the color and transparency of borders. We also add a scale bar, positioning it to the left and having a thickness of 0.8 units, and turn off the frame.

tm_shape(Chi_tracts) + tm_fill(col = "gray90") + tm_borders(alpha=0.2, col = "gray10") +
  tm_scale_bar(position = ("left"), lwd = 0.8) +
  tm_layout(frame = F)

Check out https://rdrr.io/cran/tmap/man/tm_polygons.html for more ideas.

A.4 Converting to Spatial Data

A.4.1 Convert CSVs to Spatial Data

We are using the Affordable_Rental_Housing_Developments.csv in the dataset to show how to convert a csv Lat/Long data to points. First, we need to load the CSV data.

housing = read.csv("SDOHPlace-DataWrangling/Affordable_Rental_Housing_Developments.csv")

Then, we need to ensure that no column (intended to be used as a coordinate) is entirely empty or filled with NA values.

cleaned_housing <- na.omit(housing)

Inspect the data to confirm it’s doing what you expect it to be doing. What columns will you use to specify the coordinates? In this dataset, we have multiple coordinate options. We’ll use latitude and longitude, or rather, longitude as our X value, and latitude as our Y value. In the data, it’s specified as “Longitude” and “Latitude.”

head(cleaned_housing)
##   Community.Area.Name Community.Area.Number      Property.Type                   Property.Name                 Address Zip.Code Phone.Number               Management.Company Units X.Coordinate
## 2         Rogers Park                     1             Senior              Morse Senior Apts.      6928 N. Wayne Ave.    60626 312-602-6207                 Morse Urban Dev.    44      1165844
## 3              Uptown                     3                ARO                      The Draper        5050 N. Broadway    60640 312-818-1722                        Flats LLC    35      1167357
## 4           Edgewater                    77             Senior                   Pomeroy Apts.    5650 N. Kenmore Ave.    60660 773-275-7820                  Habitat Company   198      1168181
## 5            Roseland                    49 Supportive Housing               Wentworth Commons 11045 S. Wentworth Ave.    60628 773-568-7804          Mercy Housing Lakefront    50      1176951
## 6       Humboldt Park                    23        Multifamily            Nelson Mandela Apts.      607 N. Sawyer Ave.    60624 773-227-6332                 Bickerdike Apts.     6      1154640
## 7     Grand Boulevard                    38        Multifamily Legends South - Gwendolyn Place   4333 S. Michigan Ave.    60653 773-624-7676 Interstate Realty Management Co.    71      1177924
##   Y.Coordinate Latitude Longitude                              Location
## 2      1946059 42.00757 -87.66517 (42.0075737709331, -87.6651711448293)
## 3      1933882 41.97413 -87.65996 (41.9741295261027, -87.6599553011627)
## 4      1937918 41.98519 -87.65681  (41.9851867755403, -87.656808676983)
## 5      1831516 41.69302 -87.62777 (41.6930159120977, -87.6277673462214)
## 6      1903912 41.89215 -87.70753 (41.8921534052465, -87.7075265659001)
## 7      1876178 41.81555 -87.62286  (41.815550396096, -87.6228565224104)

Finally, we start to convert it to points. Be sure you use the CRS of the original coordinates recorded. In this case we weren’t sure what CRS that was, so we use EPSG:4326 to test.

points_housing <- st_as_sf(cleaned_housing, coords = c("Longitude", "Latitude"), crs = 4326)

View the resulting sf object with a basemap to confirm they are in the right place. Overlay them on top of the tract data, to confirm they are plotting correctly.

### First Layer
tm_shape(Chi_tracts) + tm_borders(lwd = 0.5) + 
  
  ### Second Layer
  tm_shape(points_housing) + tm_dots(size = 0.1 )

You can change the tmap_mode to “view” to add a basemap in an interactive setting, and then switch back to “plot” when complete. Because we’e plotting dots using tmap, we’ll use the tm_dots parameter for styling.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(Chi_tracts) + tm_borders(lwd = 0.5) + 
  tm_shape(points_housing) + tm_dots(size = 0.01)
tmap_mode("plot")
## tmap mode set to plotting

We’ll reproject to EPSG:3435, our system standard

housing.3435 <- st_transform(points_housing, "EPSG:3435")

A.4.1.1 Write Spatial Data

Finally, we can save our points as a spatial dataset. Use ‘st_write’ to write your spatial object in R to a data format of your choice. Here, we’ll write to a geojson file.

Uncomment to run this line.

#st_write(housing.3435, "housing.geojson", driver = "GeoJSON")

You could also save as a “housing.shapefile” to get a shapefile format, however you’ll get an error noting that some column names are too long and must be shortened. Shapefile formats have a limit of 10 characters for field names.

#st_write(housing.3435, "housing.shp", driver = "ESRI Shapefile")

The file may still write, but the column names that were too long may be shortened automatically.

To change column or field names in R objects, there are dozens of options. Try searching and “googling” different search terms to identify solutions on your own.

A.4.2 Geocode Addresses

Here, we will use chicago_methadone_nogeometry.csv for practice, which includes methadone centers in Chicago (center names and addresses). First we load the tidygeocoder to get our geocoding done.

library(tidygeocoder)

Let’s read in and inspect data for methadone maintenance providers. Note, these addresses were made available by SAMSHA, and are known as publicly available information. An additional analysis could call each service to check on access to medication during COVID in Septmber 2020, and the list would be updated further.

methadoneClinics <- read.csv("SDOHPlace-DataWrangling/chicago_methadone_nogeometry.csv")

head(methadoneClinics)
##   X                                                         Name                 Address    City State   Zip
## 1 1                Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago    IL 60640
## 2 2                      Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago    IL 60640
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview    3934 N. Lincoln Ave. Chicago    IL 60613
## 4 4                                        PDSSC - Chicago, Inc.     2260 N. Elston Ave. Chicago    IL 60614
## 5 5                          Center for Addictive Problems, Inc.        609 N. Wells St. Chicago    IL 60654
## 6 6                                Family Guidance Centers, Inc.     310 W. Chicago Ave. Chicago    IL 60654

Let’s geocode one address first, just to make sure our system is working. We’ll use the “cascade” method which use the US Census and OpenStreetMap geocoders. These two services are the main options with tidygeocoder.

sample <- geo("2260 N. Elston Ave. Chicago, IL", lat = latitude, long = longitude, method = 'cascade')
## Passing 1 address to the US Census single address geocoder
## Query completed in: 0.9 seconds
head(sample)
## # A tibble: 1 × 4
##   address                         latitude longitude geo_method
##   <chr>                              <dbl>     <dbl> <chr>     
## 1 2260 N. Elston Ave. Chicago, IL     41.9     -87.7 census

As we prepare for geocoding, check out the structure of the dataset. The data should be a character to be read properly.

str(methadoneClinics)
## 'data.frame':    27 obs. of  6 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name   : chr  "Chicago Treatment and Counseling Center, Inc." "Sundace Methadone Treatment Center, LLC" "Soft Landing Interventions/DBA Symetria Recovery of Lakeview" "PDSSC - Chicago, Inc." ...
##  $ Address: chr  "4453 North Broadway st." "4545 North Broadway St." "3934 N. Lincoln Ave." "2260 N. Elston Ave." ...
##  $ City   : chr  "Chicago" "Chicago" "Chicago" "Chicago" ...
##  $ State  : chr  "IL" "IL" "IL" "IL" ...
##  $ Zip    : int  60640 60640 60613 60614 60654 60654 60651 60607 60607 60616 ...

We need to clean the data a bit. We’ll add a new column for a full address, as required by the geocoding service. When you use a geocoding service, be sure to read the documentation and understand how the data needs to be formatted for input.

methadoneClinics$fullAdd <- paste(as.character(methadoneClinics$Address), 
                                  as.character(methadoneClinics$City),
                                  as.character(methadoneClinics$State), 
                                  as.character(methadoneClinics$Zip))

We’re ready to go! Batch geocode with one function, and inspect:

geoCodedClinics <-  geocode(methadoneClinics,
address = 'fullAdd', lat = latitude, long = longitude, method = 'cascade')
## Passing 27 addresses to the US Census batch geocoder
## Query completed in: 0.4 seconds
head(geoCodedClinics)
## # A tibble: 6 × 10
##       X Name                                                         Address                 City    State   Zip fullAdd                                  latitude longitude geo_method
##   <int> <chr>                                                        <chr>                   <chr>   <chr> <int> <chr>                                       <dbl>     <dbl> <chr>     
## 1     1 Chicago Treatment and Counseling Center, Inc.                4453 North Broadway st. Chicago IL    60640 4453 North Broadway st. Chicago IL 60640     42.0     -87.7 census    
## 2     2 Sundace Methadone Treatment Center, LLC                      4545 North Broadway St. Chicago IL    60640 4545 North Broadway St. Chicago IL 60640     42.0     -87.7 census    
## 3     3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview 3934 N. Lincoln Ave.    Chicago IL    60613 3934 N. Lincoln Ave. Chicago IL 60613        42.0     -87.7 census    
## 4     4 PDSSC - Chicago, Inc.                                        2260 N. Elston Ave.     Chicago IL    60614 2260 N. Elston Ave. Chicago IL 60614         41.9     -87.7 census    
## 5     5 Center for Addictive Problems, Inc.                          609 N. Wells St.        Chicago IL    60654 609 N. Wells St. Chicago IL 60654            41.9     -87.6 census    
## 6     6 Family Guidance Centers, Inc.                                310 W. Chicago Ave.     Chicago IL    60654 310 W. Chicago Ave. Chicago IL 60654         41.9     -87.6 census

There were two that didn’t geocode correctly. You can inspect further. This could involve a quick check for spelling issues; or, searching the address and pulling the lat/long using Google Maps and inputting manually. Or, if we are concerned it’s a human or unknown error, we could omit. For this exercise we’ll just omit the two clinics that didn’t geocode correctly.

geoCodedClinics2 <- na.omit(geoCodedClinics)

A.5 Convert to Spatial Data

This is not spatial data yet! To convert a static file to spatial data, we use the powerful st_as_sf function from sf. Indicate the x,y parameters (=longitude, latitude) and the coordinate reference system used. Our geocoding service used the standard EPSG:4326, so we input that here.

methadoneSf <- st_as_sf(geoCodedClinics2, 
                        coords = c( "longitude", "latitude"),
                        crs = 4326)

Basic Map of Points

For a really simple map of points – to ensure they were geocoded and converted to spatial data correctly, we use tmap. We’ll use the interactive version to view.

tmap_mode("view")

tm_shape(methadoneSf) + tm_dots() 

If your points didn’t plot correctly:

  • Did you flip the longitude/latitude values?
  • Did you input the correct CRS?

Those two issues are the most common errors.

A.6 Merge Data sets

Reshape Data

Here, we are trying to use the COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv dataset to practice how to convert long data to a wide data format.

We subset to the first two columns, and the sixth column. That gives us the zip code, the reporting week, and cumalative cases of Covid-19. We want each zip code to be a unique row, with cases by week as a columns Choose whatever subset functioon you prefer best!

covid = read.csv("SDOHPlace-DataWrangling/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")

covid_clean = covid[,c(1:2, 6)]

head(covid_clean) 
##   ZIP.Code Week.Number Cases...Cumulative
## 1    60603          39                 13
## 2    60604          39                 31
## 3    60611          16                 72
## 4    60611          15                 64
## 5    60615          11                 NA
## 6    60603          10                 NA

Now, we are trying to create a wide data set with the cumulative cases for each week for each zip code. Enter the code and you will see the new wide data format.

covid_wide <- reshape(covid_clean, direction = "wide",
                      idvar = "ZIP.Code", timevar = "Week.Number") 
head(covid_wide)
##    ZIP.Code Cases...Cumulative.39 Cases...Cumulative.16 Cases...Cumulative.15 Cases...Cumulative.11 Cases...Cumulative.10 Cases...Cumulative.12 Cases...Cumulative.13 Cases...Cumulative.14
## 1     60603                    13                    NA                    NA                    NA                    NA                    NA                    NA                    NA
## 2     60604                    31                    NA                    NA                    NA                    NA                    NA                    NA                    NA
## 3     60611                   458                    72                    64                    NA                    NA                    16                    41                    57
## 5     60615                   644                   171                   132                    NA                    NA                    26                    57                    99
## 31    60605                   391                    93                    65                    NA                    NA                    23                    39                    52
## 32  Unknown                   240                    23                    18                    NA                    NA                    NA                    NA                     9
##    Cases...Cumulative.34 Cases...Cumulative.17 Cases...Cumulative.18 Cases...Cumulative.19 Cases...Cumulative.20 Cases...Cumulative.31 Cases...Cumulative.22 Cases...Cumulative.23
## 1                     11                    NA                    NA                    NA                     5                     9                     6                     6
## 2                     29                     6                    11                    14                    17                    25                    22                    23
## 3                    352                    80                    92                    99                   114                   286                   139                   148
## 5                    567                   215                   243                   274                   302                   526                   353                   364
## 31                   325                   118                   135                   149                   155                   291                   187                   194
## 32                   162                    33                    53                    62                    69                   127                    97                   102
##    Cases...Cumulative.24 Cases...Cumulative.25 Cases...Cumulative.28 Cases...Cumulative.29 Cases...Cumulative.30 Cases...Cumulative.32 Cases...Cumulative.33 Cases...Cumulative.26
## 1                      6                     6                     6                     8                     9                    10                    11                     6
## 2                     24                    24                    25                    25                    25                    25                    25                    25
## 3                    152                   163                   223                   240                   264                   305                   333                   175
## 5                    376                   388                   444                   482                   506                   539                   556                   401
## 31                   198                   215                   247                   263                   277                   304                   312                   229
## 32                   106                   112                   120                   124                   126                   132                   142                   113
##    Cases...Cumulative.27 Cases...Cumulative.36 Cases...Cumulative.38 Cases...Cumulative.21 Cases...Cumulative.35 Cases...Cumulative.37 Cases...Cumulative.40
## 1                      6                    11                    13                     6                    11                    13                    14
## 2                     25                    31                    31                    20                    30                    31                    31
## 3                    196                   391                   435                   124                   371                   411                   478
## 5                    418                   606                   629                   332                   588                   613                   651
## 31                   235                   354                   379                   169                   333                   364                   399
## 32                   116                   186                   216                    92                   178                   201                   288

Join by Attribute

Here, we’ll merge data sets with a common variable in R. Merging the cumulative case data set you created in the last section to zip code spatial data (ChiZipMaster1.geojson) will allow you to map the case data. You’ll be merging the case data and spatial data using the zip codes field of each dataset.

We’ve cleaned our covid case data already, but not all values under the zipcode column are valid. There is a row has a value of “unkown”, so let’s remove that.

covid_wide_clean <- covid_wide %>%
  filter(ZIP.Code != "unknown" & !is.na(ZIP.Code))

Then, we need to load the zipcode data.

zipcode <- st_read("SDOHPlace-DataWrangling/ChiZipMaster1.geojson")

Your output will look something like:

## Reading layer `ChiZipMaster1' from data source `./SDOHPlace-DataWrangling/ChiZipMaster1.geojson' using driver `GeoJSON'
## Simple feature collection with 540 features and 31 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.87596 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

You’ll notice that the zip codes are repeated in the zip code data set, and needs to be cleaned before we can continue with merging the data.

zipcode_unique <- distinct(zipcode)

zipcode_unique <- zipcode %>%
  group_by(zip) %>%
  slice(1) %>%
  ungroup()

Now, the two datasets are ready to join together by the zipcode. Make sure to check they have been joined successully.

We’ll have these joined datasets be called Chi_Zipsf, to denote a final zip code master dataset.

Chi_Zipsf <- zipcode_unique %>%
  left_join(covid_wide_clean, by = c("zip" = "ZIP.Code"))

We’ll reproject to EPSG:3435, the standard used in our study area.

Chi_Zipsf.3435 <- st_transform(Chi_Zipsf, 3435)

Join by Location

We’ll create a spatial join with the housing and zip code data we’ve brought in.

In this example, we want to join zip-level data to the Rental Housing Developments, so we can identify which zips they are within.

First, let’s try “sticking” all the zip code data too the housing points, intersecting zip codes with housing developments.

To do this, both datasets will need to be in the same CRS. We have already standardized both using EPSG:3435.

Housing.f <- st_join(housing.3435, Chi_Zipsf.3435, join = st_intersects)

Don’t forget to inspect the data. Uncomment to explore!

#head(Housing.f)

We could also flip things around, and try to count how many developments intersect each zip code We can use lengths() to find out how many items are present in a vector. Here,

Chi_Zipsf.3435$TotHousing <- lengths(st_intersects(Chi_Zipsf.3435, housing.3435))

head(Chi_Zipsf.3435)
## Simple feature collection with 6 features and 63 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1173037 ymin: 1889918 xmax: 1183259 ymax: 1902960
## Projected CRS: NAD83 / Illinois East (ftUS)
## # A tibble: 6 × 64
##   zip   objectid shape_area shape_len Case.Rate...Cumulative  year totPopE whiteP blackP amIndP asianP pacIsP otherP hispP noHSP age0_4 age5_14 age15_19 age20_24 age15_44 age45_49 age50_54
##   <chr>    <dbl>      <dbl>     <dbl>                  <dbl> <int>   <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <int>   <int>    <int>    <int>    <int>    <int>    <int>
## 1 60601       27   9166246.    19805.                  1451.  2018   14675   74.2   5.57   0.45   18     0      1.81  8.68  0       550     156      907      909     8726      976     1009
## 2 60602       26   4847125.    14448.                  1688.  2018    1244   68.2   3.78   5.31   19.4   0      3.3   6.51  0        61      87       18       91      987       46       53
## 3 60603       19   4560229.    13673.                  1107.  2018    1174   63.5   3.24   0      27.6   0      5.71  9.8   0        13      43      179      172      684       75       47
## 4 60604       48   4294902.    12246.                  3964.  2018     782   63.4   5.63   0      29.7   0      1.28  4.35  0        12       7       52      168      450       27       47
## 5 60605       20  36301276.    37973.                  1421.  2018   27519   61.2  17.2    0.18   16.1   0.03   5.31  5.84  2.39    837    1279     2172     2282    16364     1766     1520
## 6 60606       31   6766411.    12040.                  2290.  2018    3101   72.8   2.35   0      18.1   0      6.8   6.29  0.73     57      44        0      139     1863      213      153
## # ℹ 42 more variables: age55_59 <int>, age60_64 <int>, ageOv65 <int>, ageOv18 <int>, age18_64 <int>, a15_24P <dbl>, und45P <dbl>, ovr65P <dbl>, disbP <dbl>,
## #   geometry <MULTIPOLYGON [US_survey_foot]>, Cases...Cumulative.39 <int>, Cases...Cumulative.16 <int>, Cases...Cumulative.15 <int>, Cases...Cumulative.11 <int>, Cases...Cumulative.10 <int>,
## #   Cases...Cumulative.12 <int>, Cases...Cumulative.13 <int>, Cases...Cumulative.14 <int>, Cases...Cumulative.34 <int>, Cases...Cumulative.17 <int>, Cases...Cumulative.18 <int>,
## #   Cases...Cumulative.19 <int>, Cases...Cumulative.20 <int>, Cases...Cumulative.31 <int>, Cases...Cumulative.22 <int>, Cases...Cumulative.23 <int>, Cases...Cumulative.24 <int>,
## #   Cases...Cumulative.25 <int>, Cases...Cumulative.28 <int>, Cases...Cumulative.29 <int>, Cases...Cumulative.30 <int>, Cases...Cumulative.32 <int>, Cases...Cumulative.33 <int>,
## #   Cases...Cumulative.26 <int>, Cases...Cumulative.27 <int>, Cases...Cumulative.36 <int>, Cases...Cumulative.38 <int>, Cases...Cumulative.21 <int>, Cases...Cumulative.35 <int>,
## #   Cases...Cumulative.37 <int>, Cases...Cumulative.40 <int>, TotHousing <int>

A.7 Inspect Data

A.7.1 Thematic Maps

To inspect data from a spatial perspective, we can create a series of choropleth maps.

Example 1. Number of Affordable Housing Developments per Zip Code

Choose the variable “TotHousing” to map total developments per zip coode, as we calculated previously. Here we’ll map using Jenks data breaks, with a Blue to Purple palette, and four bins (or groups of data to be plotted). A histogram of the data is plotted, visualizing where breaks occured in the data to generate the map.

tmap_mode("plot")

tm_shape(Chi_Zipsf.3435) + tm_polygons("TotHousing", legend.hist = TRUE, style="jenks", pal="BuPu", n=4, title = "Housing Dev.") + 

  tm_layout(legend.outside = TRUE, legend.outside.position = "right")

Example 2. Number of COVID-19 Cases per Zip Code

Let’s do the same, but plut a different variable. Select a different variable name as your parameter in the ‘tm_fill’ parameter.

tm_shape(Chi_Zipsf.3435) + tm_polygons("Case.Rate...Cumulative",
              legend.hist = TRUE, style="jenks", 
              pal="BuPu", n=4, title = "COVID Case Rate") + 

  tm_layout(legend.outside = TRUE, legend.outside.position = "right")

A.7.2 Map Overlay

Example 1. Afforfable Housing Developments & Zipcode Boundaries

We previously translated the housing dataset from a CSV to a spatial object. Let’s take an attribute connect with each point, the total number of units per housing development, and visualize as a graduated symbology. Points with more units will be bigger, and not all places are weighted the same visually.

We use the “style” parameter to aadd a standard deviation data classification break. Check out tmap documentation for more options, like quantiles, natural jenks, or other options.

tm_shape(housing.3435) + tm_bubbles("Units", col = "purple", style = "sd") 

Then, let’s overlay that layer to the zipcode boundary.

tm_shape(Chi_Zipsf.3435) + tm_polygons(col = "gray80") +
  
  tm_shape(housing.3435) + tm_bubbles("Units", col = "purple") 

You can also color-code according to the total number of units. Here, we’ll add a palette using a “viridis” color scheme, as a graduate color point map. For extra style, we’ll add labels to each zip code, update with a new basemap, and make the whole map interactive.

tmap_mode("view")
## tmap mode set to interactive viewing
#Add Basemap
tm_basemap("Esri.WorldGrayCanvas") + 
  
  #Add First Layer, Style
  tm_shape(Chi_Zipsf.3435) +  tm_borders(col = "gray10") + 
    tm_text("zip", size = 0.7) +

  #Add Second Layer, Style
  tm_shape(housing.3435) + 
    tm_bubbles( col = "Units", style = "quantile", 
                pal = "viridis", size = 0.1) 

Example 2. COVID-19 & Methadone

In the first example, let create a map showing both COVID-19 and methadone clinic data (used in A.3). First, let’s add our zipcode map.

With this overlay, we’ll add a “hack” to include the methadone clinic points in a legend.

tmap_mode("plot")
## tmap mode set to plotting
##Add and style First Layer
tm_shape(Chi_Zipsf) + tm_polygons("Case.Rate...Cumulative", 
              style="jenks", pal="BuPu", n=4, title = "COVID Rt") + 
  
  ##Add Second Layer
  tm_shape(methadoneSf) + tm_dots(size = 0.2, col = "gray20") +
  
  ##"Hack" a manual symbology for dots in the legend
  tm_add_legend("symbol", col = "gray20", size = .2, labels = "Methadone MOUD") +
  
  ##Final Cartographic Styling
  tm_layout(legend.outside = TRUE, legend.outside.position = "right")

Resources

  • For tips on using tmap, check out the online text, Elegant and informative maps with tmap by Tennekes and Nowosad.

  • Try out more mapping with the ggplot2 library. The Maps chapter will give you a head start.

  • We highly recommend Chapters 3-5 as mandatory reading in this classic, Geocomputation with R by Lovelace, Nowosad, and Muenchow. Perfecting selections and filters in the Attribute Data Operations chapter will help you become a data wrangling master. Perfect distance metrics and essential GIS operations in subsequent chapters.

  • The Appendix in Gimond’s Intro to GIS online book has a super overview of R examples, not to be missed.

  • Another superb resource is Analyzing US Census Data by Kyle Walker, with some of our favorite examples of extracing & reshaping data directly from the Census using code. Highly recommended!