
A Data Wrangling in R
In this part of our toolkit, we’re going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we’ll use R code to handle our spatial data.
Getting Started
R is a great choice for starting in data science because it’s built for it. It’s not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.
We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.
Tutorials for R
If you are new to R, we recommend the following intro-level tutorials provided through installation guides. You can also refer to this R for Social Scientists tutorial developed by Data Carpentry for a refresher.
You can also visit the RStudio Education page to select a learning path tailored to your experience level (Beginners, Intermediates, Experts). They offer detailed instructions to learners at different stages of their R journey.
A.1 Environment Setup
Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.
Download the Activity Datasets
Please download and unzip this file to get started: SDOHPlace-DataWrangling.zip
Setting Up the Working Directory
Setting up a working directory in R is crucial–any file paths you use in your code, like the names of files to read data from, will be treated at relative to your current working directly. You should set the working directory to any folder on your system where you plan to store your datasets and R scripts.
To set your working directory, use
To see what your working directory is currently set to, use getwd()
.
A.1.0.1 Example: Local directory on Windows
You can use forward slashes for directory paths:
or you can use double backslashes:
A.1.0.2 Example: Local directory on macOS or Linux
Use forward slashes:
You can also use ~
as an abbreviation for your own home directory:
A.1.0.3 Example: Posit Cloud
Posit Cloud is a web environment that allows you to create R projects online. In a Posit project, your working directory will already be set to your project. You can confirm this like so:
Installing & Working with R Libraries
Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:
sf
: simplifies spatial data manipulationtmap
: streamlines thematic map creationdplyr
: facilitates data manipulationtidygeocoder
: converts addresses to coordinates easily
You only need to install packages once in an R environment. Use these commands to install the packages needed for this tutorial.
install.packages("sf")
install.packages("tmap")
install.packages("tidygeocoder")
install.packages("dplyr")
Installation Tip
For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the sf
package. Using homebrew to install gdal
usually fixes any remaining issues.
Now, loading the required libraries for further steps:
A.2 Intro to Spatial Data
Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.
Load Spatial Data
We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let’s use chicagotracts.shp for practice, which includes the census tracts boundary in Chicago.
First, we need to read the shapefile data from where you save it.
Your output will look something like:
## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS: WGS 84
Always inspect data when loading in. Let’s look at a non-spatial view.
## Simple feature collection with 6 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.68822 ymin: 41.72902 xmax: -87.62394 ymax: 41.87455
## Geodetic CRS: WGS 84
## commarea commarea_n countyfp10 geoid10 name10 namelsad10 notes statefp10 tractce10
## 1 44 44 031 17031842400 8424 Census Tract 8424 <NA> 17 842400
## 2 59 59 031 17031840300 8403 Census Tract 8403 <NA> 17 840300
## 3 34 34 031 17031841100 8411 Census Tract 8411 <NA> 17 841100
## 4 31 31 031 17031841200 8412 Census Tract 8412 <NA> 17 841200
## 5 32 32 031 17031839000 8390 Census Tract 8390 <NA> 17 839000
## 6 28 28 031 17031838200 8382 Census Tract 8382 <NA> 17 838200
## geometry
## 1 POLYGON ((-87.62405 41.7302...
## 2 POLYGON ((-87.68608 41.8229...
## 3 POLYGON ((-87.62935 41.8528...
## 4 POLYGON ((-87.68813 41.8556...
## 5 POLYGON ((-87.63312 41.8744...
## 6 POLYGON ((-87.66782 41.8741...
Check out the data structure of this file.
## Classes 'sf' and 'data.frame': 801 obs. of 10 variables:
## $ commarea : chr "44" "59" "34" "31" ...
## $ commarea_n: num 44 59 34 31 32 28 65 53 76 77 ...
## $ countyfp10: chr "031" "031" "031" "031" ...
## $ geoid10 : chr "17031842400" "17031840300" "17031841100" "17031841200" ...
## $ name10 : chr "8424" "8403" "8411" "8412" ...
## $ namelsad10: chr "Census Tract 8424" "Census Tract 8403" "Census Tract 8411" "Census Tract 8412" ...
## $ notes : chr NA NA NA NA ...
## $ statefp10 : chr "17" "17" "17" "17" ...
## $ tractce10 : chr "842400" "840300" "841100" "841200" ...
## $ geometry :sfc_POLYGON of length 801; first list element: List of 1
## ..$ : num [1:243, 1:2] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr [1:9] "commarea" "commarea_n" "countyfp10" "geoid10" ...
The data is no longer a shapefile but an sf
object, comprised of polygons. The plot()
command in R help to quickly visualizes the geometric shapes of Chicago’s census tracts. The output includes multiple maps because the sf
framework enables previews of each attribute in our spatial file.
A.2.1 Adding a Basemap
Then, we can use tmap
, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from Chi_tracts
, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area’s census tracts.
We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5).
## ℹ tmap mode set to "view".
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
Still in the interactive mode (`view’), we can switch to a different basemap. Here we bring in a “Voyager” style map from Carto, a cartographic firm. We’ll make the borders less transparent by adjusting the alpha level.
## ℹ tmap mode set to "view".
tm_basemap("CartoDB.Voyager") +
tm_shape(Chi_tracts) + tm_borders(alpha=0.8, col = "gray40") +
tm_layout(title = "Census Tract Map of Chicago")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
Tip
For additional options, you can preview basemaps at the Leaflet Providers Demo. Some basemaps we recommend that work consistently are by:
- CartoDB (Carto)
- Open Street Map
- ESRI
Not all basemaps are available anymore, and some require keys that you’d need to add on your own.
A.3 Coordinate Reference Systems
For this exercise we will use chicagotracts.shp
to explore how to change the projection of a spatial dataset in R. First, let’s check out the current coordinate reference system.
## Coordinate Reference System:
## User input: WGS 84
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]
We can use the st_transform
function to transform CRS. When projecting a dataset of Illinois, the most appropriate NAD83 projection would be NAD83 UTM zone 16N. Chicago sits within the area best covered by NAD83 / Illinois East (ftUS) (EPSG:3435).
Chi_tracts.3435 <- st_transform(Chi_tracts, "EPSG:3435")
# Chi_tracts.3435 <- st_transform(Chi_tracts, 3435)
st_crs(Chi_tracts.3435)
## Coordinate Reference System:
## User input: EPSG:3435
## wkt:
## PROJCRS["NAD83 / Illinois East (ftUS)",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["SPCS83 Illinois East zone (US survey foot)",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",36.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-88.3333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.999975,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",984250,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United States (USA) - Illinois - counties of Boone; Champaign; Clark; Clay; Coles; Cook; Crawford; Cumberland; De Kalb; De Witt; Douglas; Du Page; Edgar; Edwards; Effingham; Fayette; Ford; Franklin; Gallatin; Grundy; Hamilton; Hardin; Iroquois; Jasper; Jefferson; Johnson; Kane; Kankakee; Kendall; La Salle; Lake; Lawrence; Livingston; Macon; Marion; Massac; McHenry; McLean; Moultrie; Piatt; Pope; Richland; Saline; Shelby; Vermilion; Wabash; Wayne; White; Will; Williamson."],
## BBOX[37.06,-89.27,42.5,-87.02]],
## ID["EPSG",3435]]
After change the projection, we can plot the map. We’ll switch to the static version of tmap, using the plot
mode.
## ℹ tmap mode set to "plot".
tm_shape(Chi_tracts.3435) + tm_borders(alpha=0.5) +
tm_layout(main.title ="EPSG:3435 (ft)",
main.title.position = "center")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
What if we had used the wrong EPSG code, referencing the wrong projection? Here we’ll transform and plot EPSG Code 3561, a coordinate reference system in Hawai’i.
Chi_tracts.3561 <- st_transform(Chi_tracts, "EPSG:3561")
tm_shape(Chi_tracts.3561) + tm_borders(alpha=0.5) +
tm_layout(main.title ="EPSG:3561 (ft)",
main.title.position = "center")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
It’s obviously not correct – the wrong CRS can cause a load of trouble. Make sure you specify carefully!
Refine Basic Map
Let’s take a deeper look at the cartographic mapping package, tmap
. We approach mapping with one layer at a time. Always start with the object you want to map by calling it with the tm_shape
function. Then, at least one descriptive/styling function follows. There are hundreds of variations and paramater specification.
Here we style the tracts with some semi-transparent borders.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.
Next we fill the tracts with a light gray, and adjust the color and transparency of borders. We also add a scale bar, positioning it to the left and having a thickness of 0.8 units, and turn off the frame.
tm_shape(Chi_tracts) + tm_fill(col = "gray90") + tm_borders(alpha=0.2, col = "gray10") +
tm_scale_bar(position = ("bottom"), lwd = 0.8) +
tm_layout(frame = F)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.
## ! `tm_scale_bar()` is deprecated. Please use `tm_scalebar()` instead.
Check out https://rdrr.io/cran/tmap/man/tm_polygons.html for more ideas.
A.4 Converting to Spatial Data
A.4.1 Convert CSVs to Spatial Data
We are using the Affordable_Rental_Housing_Developments.csv
in the dataset to show how to convert a csv Lat/Long data to points. First, we need to load the CSV data.
Then, we need to ensure that no column (intended to be used as a coordinate) is entirely empty or filled with NA values.
Inspect the data to confirm it’s doing what you expect it to be doing. What columns will you use to specify the coordinates? In this dataset, we have multiple coordinate options. We’ll use latitude and longitude, or rather, longitude as our X value, and latitude as our Y value. In the data, it’s specified as “Longitude” and “Latitude.”
## Community.Area.Name Community.Area.Number Property.Type Property.Name Address
## 2 Rogers Park 1 Senior Morse Senior Apts. 6928 N. Wayne Ave.
## 3 Uptown 3 ARO The Draper 5050 N. Broadway
## 4 Edgewater 77 Senior Pomeroy Apts. 5650 N. Kenmore Ave.
## 5 Roseland 49 Supportive Housing Wentworth Commons 11045 S. Wentworth Ave.
## 6 Humboldt Park 23 Multifamily Nelson Mandela Apts. 607 N. Sawyer Ave.
## 7 Grand Boulevard 38 Multifamily Legends South - Gwendolyn Place 4333 S. Michigan Ave.
## Zip.Code Phone.Number Management.Company Units X.Coordinate Y.Coordinate Latitude Longitude
## 2 60626 312-602-6207 Morse Urban Dev. 44 1165844 1946059 42.00757 -87.66517
## 3 60640 312-818-1722 Flats LLC 35 1167357 1933882 41.97413 -87.65996
## 4 60660 773-275-7820 Habitat Company 198 1168181 1937918 41.98519 -87.65681
## 5 60628 773-568-7804 Mercy Housing Lakefront 50 1176951 1831516 41.69302 -87.62777
## 6 60624 773-227-6332 Bickerdike Apts. 6 1154640 1903912 41.89215 -87.70753
## 7 60653 773-624-7676 Interstate Realty Management Co. 71 1177924 1876178 41.81555 -87.62286
## Location
## 2 (42.0075737709331, -87.6651711448293)
## 3 (41.9741295261027, -87.6599553011627)
## 4 (41.9851867755403, -87.656808676983)
## 5 (41.6930159120977, -87.6277673462214)
## 6 (41.8921534052465, -87.7075265659001)
## 7 (41.815550396096, -87.6228565224104)
Finally, we start to convert it to points. Be sure you use the CRS of the original coordinates recorded. In this case we weren’t sure what CRS that was, so we use EPSG:4326
to test.
View the resulting sf object with a basemap to confirm they are in the right place. Overlay them on top of the tract data, to confirm they are plotting correctly.
### First Layer
tm_shape(Chi_tracts) + tm_borders(lwd = 0.5) +
### Second Layer
tm_shape(points_housing) + tm_dots(size = 0.1 )
You can change the tmap_mode
to “view” to add a basemap in an interactive setting, and then switch back to “plot” when complete. Because we’e plotting dots using tmap, we’ll use the tm_dots
parameter for styling.
## ℹ tmap mode set to "view".
## ℹ tmap mode set to "plot".
We’ll reproject to EPSG:3435
, our system standard
A.4.1.1 Write Spatial Data
Finally, we can save our points as a spatial dataset. Use ‘st_write’ to write your spatial object in R to a data format of your choice. Here, we’ll write to a geojson file.
You could also save as a “housing.shapefile” to get a shapefile format, however you’ll get an error noting that some column names are too long and must be shortened. Shapefiles have a limit of 10 characters for field names.
The file may still write, but the column names that were too long may be shortened automatically.
To change column or field names in R objects, there are dozens of options. Try searching and “Googling” different search terms to identify solutions on your own.
A.4.2 Geocode Addresses
Here, we will use chicago_methadone_nogeometry.csv
for practice, which includes methadone centers in Chicago (center names and addresses). First we load the tidygeocoder
library to get our geocoding done.
Let’s read in and inspect data for methadone maintenance providers. Note, these addresses were made available by SAMHSA, and are known as publicly available information. An additional analysis could call each service to check on access to medication during COVID in September 2020, and the list would be updated further.
methadoneClinics <- read.csv("SDOHPlace-DataWrangling/chicago_methadone_nogeometry.csv")
head(methadoneClinics)
## X Name Address City State Zip
## 1 1 Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago IL 60640
## 2 2 Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago IL 60640
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview 3934 N. Lincoln Ave. Chicago IL 60613
## 4 4 PDSSC - Chicago, Inc. 2260 N. Elston Ave. Chicago IL 60614
## 5 5 Center for Addictive Problems, Inc. 609 N. Wells St. Chicago IL 60654
## 6 6 Family Guidance Centers, Inc. 310 W. Chicago Ave. Chicago IL 60654
Let’s geocode one address first, just to make sure our system is working. We’ll use the “osm” method to use the OpenStreetMap geocoder.
## Passing 1 address to the Nominatim single address geocoder
## Query completed in: 1.8 seconds
## # A tibble: 1 × 3
## address latitude longitude
## <chr> <dbl> <dbl>
## 1 2260 N. Elston Ave. Chicago, IL 41.9 -87.7
As we prepare for geocoding, check out the structure of the dataset. The data should be a character to be read properly.
## 'data.frame': 27 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Chicago Treatment and Counseling Center, Inc." "Sundace Methadone Treatment Center, LLC" "Soft Landing Interventions/DBA Symetria Recovery of Lakeview" "PDSSC - Chicago, Inc." ...
## $ Address: chr "4453 North Broadway st." "4545 North Broadway St." "3934 N. Lincoln Ave." "2260 N. Elston Ave." ...
## $ City : chr "Chicago" "Chicago" "Chicago" "Chicago" ...
## $ State : chr "IL" "IL" "IL" "IL" ...
## $ Zip : int 60640 60640 60613 60614 60654 60654 60651 60607 60607 60616 ...
We need to clean the data a bit. We’ll add a new column for a full address, as required by the geocoding service. When you use a geocoding service, be sure to read the documentation and understand how the data needs to be formatted for input.
methadoneClinics$fullAdd <- paste(as.character(methadoneClinics$Address),
as.character(methadoneClinics$City),
as.character(methadoneClinics$State),
as.character(methadoneClinics$Zip))
We’re ready to go! Batch geocode with one function, and inspect:
geoCodedClinics <- geocode(methadoneClinics,
address = 'fullAdd', lat = latitude, long = longitude, method="osm")
## Passing 27 addresses to the Nominatim single address geocoder
## Query completed in: 33.4 seconds
## # A tibble: 6 × 9
## X Name Address City State Zip fullAdd latitude longitude
## <int> <chr> <chr> <chr> <chr> <int> <chr> <dbl> <dbl>
## 1 1 Chicago Treatment and Counseling Center, Inc. 4453 Nort… Chic… IL 60640 4453 N… 42.0 -87.7
## 2 2 Sundace Methadone Treatment Center, LLC 4545 Nort… Chic… IL 60640 4545 N… 42.0 -87.7
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview 3934 N. L… Chic… IL 60613 3934 N… 42.0 -87.7
## 4 4 PDSSC - Chicago, Inc. 2260 N. E… Chic… IL 60614 2260 N… 41.9 -87.7
## 5 5 Center for Addictive Problems, Inc. 609 N. We… Chic… IL 60654 609 N.… 41.9 -87.6
## 6 6 Family Guidance Centers, Inc. 310 W. Ch… Chic… IL 60654 310 W.… 41.9 -87.6
In some cases you may find that an address does not geocode correctly. You can inspect further. This could involve a quick check for spelling issues or searching the address and pulling the lat/long using Google Maps and inputting manually. Or, if we are concerned it’s a human or unknown error, we could omit. For this exercise we’ll just omit the two clinics that didn’t geocode correctly.
A.5 Convert to Spatial Data
This is not spatial data yet! To convert a static file to spatial data, we use the powerful st_as_sf
function from sf
. Indicate the x,y parameters (=longitude, latitude) and the coordinate reference system used. Our geocoding service used the standard EPSG:4326, so we input that here.
Basic Map of Points
For a really simple map of points – to ensure they were geocoded and converted to spatial data correctly, we use tmap
. We’ll use the interactive version to view.
If your points didn’t plot correctly:
- Did you flip the longitude/latitude values?
- Did you input the correct CRS?
Those two issues are the most common errors.
A.6 Merge Data sets
Reshape Data
Here, we are trying to use the COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv
dataset to practice how to convert long data to a wide data format.
We subset to the first two columns, and the sixth column. That gives us the zip code, the reporting week, and cumulative cases of Covid-19. Choose whatever subset function you prefer best!
covid = read.csv("SDOHPlace-DataWrangling/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")
covid_clean = covid[,c(1:2, 6)]
head(covid_clean)
## ZIP.Code Week.Number Cases...Cumulative
## 1 60603 39 13
## 2 60604 39 31
## 3 60611 16 72
## 4 60611 15 64
## 5 60615 11 NA
## 6 60603 10 NA
Now, we want each zip code to be a unique row, with cases by week as a columns because we are trying to create a wide data set with the cumulative cases for each week for each zip code. Enter the code and you will see the new wide data format.
covid_wide <- reshape(covid_clean, direction = "wide",
idvar = "ZIP.Code", timevar = "Week.Number")
head(covid_wide)
## ZIP.Code Cases...Cumulative.39 Cases...Cumulative.16 Cases...Cumulative.15 Cases...Cumulative.11 Cases...Cumulative.10
## 1 60603 13 NA NA NA NA
## 2 60604 31 NA NA NA NA
## 3 60611 458 72 64 NA NA
## 5 60615 644 171 132 NA NA
## 31 60605 391 93 65 NA NA
## 32 Unknown 240 23 18 NA NA
## Cases...Cumulative.12 Cases...Cumulative.13 Cases...Cumulative.14 Cases...Cumulative.34 Cases...Cumulative.17
## 1 NA NA NA 11 NA
## 2 NA NA NA 29 6
## 3 16 41 57 352 80
## 5 26 57 99 567 215
## 31 23 39 52 325 118
## 32 NA NA 9 162 33
## Cases...Cumulative.18 Cases...Cumulative.19 Cases...Cumulative.20 Cases...Cumulative.31 Cases...Cumulative.22
## 1 NA NA 5 9 6
## 2 11 14 17 25 22
## 3 92 99 114 286 139
## 5 243 274 302 526 353
## 31 135 149 155 291 187
## 32 53 62 69 127 97
## Cases...Cumulative.23 Cases...Cumulative.24 Cases...Cumulative.25 Cases...Cumulative.28 Cases...Cumulative.29
## 1 6 6 6 6 8
## 2 23 24 24 25 25
## 3 148 152 163 223 240
## 5 364 376 388 444 482
## 31 194 198 215 247 263
## 32 102 106 112 120 124
## Cases...Cumulative.30 Cases...Cumulative.32 Cases...Cumulative.33 Cases...Cumulative.26 Cases...Cumulative.27
## 1 9 10 11 6 6
## 2 25 25 25 25 25
## 3 264 305 333 175 196
## 5 506 539 556 401 418
## 31 277 304 312 229 235
## 32 126 132 142 113 116
## Cases...Cumulative.36 Cases...Cumulative.38 Cases...Cumulative.21 Cases...Cumulative.35 Cases...Cumulative.37
## 1 11 13 6 11 13
## 2 31 31 20 30 31
## 3 391 435 124 371 411
## 5 606 629 332 588 613
## 31 354 379 169 333 364
## 32 186 216 92 178 201
## Cases...Cumulative.40
## 1 14
## 2 31
## 3 478
## 5 651
## 31 399
## 32 288
Join by Attribute
Here, we’ll merge data sets with a common variable in R. Merging the cumulative case data set you created in the last section to zip code spatial data (ChiZipMaster1.geojson
) will allow you to map the case data. You’ll be merging the case data and spatial data using the zip codes field of each dataset.
We’ve cleaned our covid case data already, but not all values under the zip code column are valid. There is a row has a value of “unknown”, so let’s remove that.
Then, we need to load the zip code data.
Your output will look something like:
Reading layer `ChiZipMaster1' from data source `./SDOHPlace-DataWrangling/ChiZipMaster1.geojson' using driver `GeoJSON'
Simple feature collection with 59 features and 30 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
Now, the two datasets are ready to join together by the zip code. Make sure to check they have been joined successully.
We call the these joined datasets Chi_Zipsf
, to denote a final zip code master dataset.
We’ll reproject to EPSG:3435, the standard used in our study area.
Join by Location
We’ll create a spatial join with the housing and zip code data we’ve brought in.
In this example, we want to join zip-level data to the Rental Housing Developments, so we can identify which zips they are within.
First, let’s try “sticking” all the zip code data to the housing points, essentially intersecting zip codes with housing developments.
To do this, both datasets will need to be in the same CRS. We have already standardized both using EPSG:3435.
Don’t forget to inspect the data.
## Simple feature collection with 6 features and 73 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1154639 ymin: 1831516 xmax: 1177923 ymax: 1946059
## Projected CRS: NAD83 / Illinois East (ftUS)
## Community.Area.Name Community.Area.Number Property.Type Property.Name Address
## 2 Rogers Park 1 Senior Morse Senior Apts. 6928 N. Wayne Ave.
## 3 Uptown 3 ARO The Draper 5050 N. Broadway
## 4 Edgewater 77 Senior Pomeroy Apts. 5650 N. Kenmore Ave.
## 5 Roseland 49 Supportive Housing Wentworth Commons 11045 S. Wentworth Ave.
## 6 Humboldt Park 23 Multifamily Nelson Mandela Apts. 607 N. Sawyer Ave.
## 7 Grand Boulevard 38 Multifamily Legends South - Gwendolyn Place 4333 S. Michigan Ave.
## Zip.Code Phone.Number Management.Company Units X.Coordinate Y.Coordinate
## 2 60626 312-602-6207 Morse Urban Dev. 44 1165844 1946059
## 3 60640 312-818-1722 Flats LLC 35 1167357 1933882
## 4 60660 773-275-7820 Habitat Company 198 1168181 1937918
## 5 60628 773-568-7804 Mercy Housing Lakefront 50 1176951 1831516
## 6 60624 773-227-6332 Bickerdike Apts. 6 1154640 1903912
## 7 60653 773-624-7676 Interstate Realty Management Co. 71 1177924 1876178
## Location zip SHAPE_AREA SHAPE_LEN Case.Rate...Cumulative year totPopE whiteP blackP amIndP
## 2 (42.0075737709331, -87.6651711448293) 60626 49170579 33983.91 2911.7 2018 49730 59.20 25.27 0.10
## 3 (41.9741295261027, -87.6599553011627) 60640 77305246 48985.87 1843.2 2018 69715 64.27 16.96 0.50
## 4 (41.9851867755403, -87.656808676983) 60660 37984389 27307.81 1778.4 2018 43242 65.53 14.35 0.29
## 5 (41.6930159120977, -87.6277673462214) 60628 345241692 83748.99 2562.8 2018 66724 4.40 93.75 0.07
## 6 (41.8921534052465, -87.7075265659001) 60624 99418123 40394.41 3434.9 2018 36158 3.69 90.85 0.04
## 7 (41.815550396096, -87.6228565224104) 60653 67759828 34316.36 2430.3 2018 31972 5.22 91.00 0.29
## asianP pacIsP otherP hispP noHSP age0_4 age5_14 age15_19 age20_24 age15_44 age45_49 age50_54 age55_59 age60_64 ageOv65
## 2 5.89 0.01 9.53 20.62 11.82 2553 4765 2937 5056 25875 3295 3051 2713 2447 5031
## 3 11.03 0.03 7.21 14.26 10.31 3256 5183 1721 4535 35010 4966 4821 4142 3350 8987
## 4 13.03 0.02 6.77 17.87 10.54 1891 2448 2469 3289 22266 2821 2962 2513 2600 5741
## 5 0.18 0.02 1.58 3.44 14.84 4153 8578 4461 5226 25566 4574 4189 4764 3628 11272
## 6 0.31 0.08 5.03 5.30 26.14 3276 6034 2693 2935 15257 1913 2200 2032 1524 3922
## 7 1.34 0.00 2.15 2.16 14.92 2632 4990 1975 1811 12889 2200 1486 1883 1770 4122
## ageOv18 age18_64 a15_24P und45P ovr65P disbP Cases...Cumulative.39 Cases...Cumulative.16 Cases...Cumulative.15
## 2 41428 36397 16.07 66.75 10.12 10.8 1448 301 166
## 3 60170 51183 8.97 62.32 12.89 11.7 1285 319 175
## 4 38197 32456 13.32 61.53 13.28 12.3 769 182 115
## 5 51211 39939 14.52 57.40 16.89 15.9 1710 413 317
## 6 25134 21212 15.57 67.94 10.85 18.6 1242 334 242
## 7 23060 18938 11.84 64.15 12.89 17.8 777 238 176
## Cases...Cumulative.11 Cases...Cumulative.10 Cases...Cumulative.12 Cases...Cumulative.13 Cases...Cumulative.14
## 2 NA NA 17 40 83
## 3 NA NA 22 57 101
## 4 NA NA 16 35 63
## 5 NA NA 27 123 242
## 6 NA NA 19 76 166
## 7 NA NA 21 67 114
## Cases...Cumulative.34 Cases...Cumulative.17 Cases...Cumulative.18 Cases...Cumulative.19 Cases...Cumulative.20
## 2 1352 501 629 735 831
## 3 1130 467 567 633 696
## 4 637 224 274 314 342
## 5 1547 647 758 858 932
## 6 1124 417 510 598 671
## 7 716 286 324 376 425
## Cases...Cumulative.31 Cases...Cumulative.22 Cases...Cumulative.23 Cases...Cumulative.24 Cases...Cumulative.25
## 2 1279 1006 1074 1103 1118
## 3 1051 791 813 843 862
## 4 574 410 432 442 464
## 5 1408 1057 1077 1104 1135
## 6 1040 833 858 871 893
## 7 670 496 514 528 541
## Cases...Cumulative.28 Cases...Cumulative.29 Cases...Cumulative.30 Cases...Cumulative.32 Cases...Cumulative.33
## 2 1208 1238 1262 1296 1322
## 3 954 987 1016 1081 1103
## 4 516 534 557 587 618
## 5 1260 1309 1366 1462 1497
## 6 974 997 1020 1058 1090
## 7 601 622 648 689 702
## Cases...Cumulative.26 Cases...Cumulative.27 Cases...Cumulative.36 Cases...Cumulative.38 Cases...Cumulative.21
## 2 1142 1169 1384 1425 913
## 3 880 913 1200 1250 756
## 4 482 492 685 736 379
## 5 1166 1202 1629 1683 1003
## 6 915 942 1177 1217 778
## 7 562 578 743 762 474
## Cases...Cumulative.35 Cases...Cumulative.37 Cases...Cumulative.40 geometry
## 2 1362 1407 1468 POINT (1165843 1946059)
## 3 1167 1218 1309 POINT (1167356 1933883)
## 4 666 702 791 POINT (1168180 1937919)
## 5 1579 1656 1742 POINT (1176951 1831516)
## 6 1146 1190 1259 POINT (1154639 1903912)
## 7 727 753 789 POINT (1177923 1876178)
We could also flip things around, and try to count how many developments intersect each zip code. We can use lengths() to find out how many items are present in a vector.
Chi_Zipsf.3435$TotHousing <- lengths(st_intersects(Chi_Zipsf.3435, housing.3435))
head(Chi_Zipsf.3435)
## Simple feature collection with 6 features and 62 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1161881 ymin: 1818580 xmax: 1190986 ymax: 1902960
## Projected CRS: NAD83 / Illinois East (ftUS)
## zip SHAPE_AREA SHAPE_LEN Case.Rate...Cumulative year totPopE whiteP blackP amIndP asianP pacIsP otherP hispP noHSP
## 1 60607 64664294 39143.64 2504.1 2018 29591 58.17 14.51 0.13 22.58 0.10 4.51 8.30 3.94
## 2 60661 9357756 13132.57 1531.3 2018 9926 57.29 3.60 0.00 35.79 0.00 3.31 6.45 0.58
## 3 60606 6766411 12040.44 2289.6 2018 3101 72.75 2.35 0.00 18.09 0.00 6.80 6.29 0.73
## 4 60615 66565455 38321.31 1549.5 2018 41563 30.10 55.22 0.35 8.09 0.00 6.23 6.19 6.05
## 5 60643 207706233 75254.73 2422.3 2018 49870 22.24 73.53 0.14 0.36 0.02 3.70 3.31 8.48
## 6 60621 104746821 42299.92 2647.9 2018 29042 2.58 94.77 0.28 0.26 0.00 2.12 3.65 22.16
## age0_4 age5_14 age15_19 age20_24 age15_44 age45_49 age50_54 age55_59 age60_64 ageOv65 ageOv18 age18_64 a15_24P und45P
## 1 1660 1392 2351 4588 20860 1392 859 1122 702 1604 26241 24637 23.45 80.81
## 2 447 150 160 1032 7897 417 322 301 175 217 9243 9026 12.01 85.57
## 3 57 44 0 139 1863 213 153 168 172 431 3000 2569 4.48 63.33
## 4 2006 3512 2088 5317 21182 2293 2193 2095 2260 6022 34890 28868 17.82 64.24
## 5 3134 5603 3385 3277 17324 3225 4061 4044 3326 9153 38788 29635 13.36 52.26
## 6 2504 4102 1893 2260 11246 1317 2146 1927 1872 3928 21232 17304 14.30 61.47
## ovr65P disbP Cases...Cumulative.39 Cases...Cumulative.16 Cases...Cumulative.15 Cases...Cumulative.11
## 1 5.42 6.1 741 320 283 NA
## 2 2.19 2.3 152 20 17 NA
## 3 13.90 1.9 71 13 7 NA
## 4 14.49 12.0 644 171 132 NA
## 5 18.35 13.8 1208 344 284 NA
## 6 13.53 20.1 769 219 143 NA
## Cases...Cumulative.10 Cases...Cumulative.12 Cases...Cumulative.13 Cases...Cumulative.14 Cases...Cumulative.34
## 1 NA 17 46 79 623
## 2 NA NA 12 15 115
## 3 NA NA NA 5 58
## 4 NA 26 57 99 567
## 5 NA 22 93 195 1083
## 6 NA 14 37 91 705
## Cases...Cumulative.17 Cases...Cumulative.18 Cases...Cumulative.19 Cases...Cumulative.20 Cases...Cumulative.31
## 1 338 388 405 419 546
## 2 24 27 31 37 100
## 3 15 17 19 20 46
## 4 215 243 274 302 526
## 5 489 535 583 629 983
## 6 274 336 393 444 652
## Cases...Cumulative.22 Cases...Cumulative.23 Cases...Cumulative.24 Cases...Cumulative.25 Cases...Cumulative.28
## 1 440 445 453 464 503
## 2 41 42 43 46 66
## 3 22 27 27 27 36
## 4 353 364 376 388 444
## 5 696 714 735 749 848
## 6 517 536 548 564 601
## Cases...Cumulative.29 Cases...Cumulative.30 Cases...Cumulative.32 Cases...Cumulative.33 Cases...Cumulative.26
## 1 517 530 576 601 479
## 2 70 90 106 111 53
## 3 39 46 49 55 29
## 4 482 506 539 556 401
## 5 890 942 1011 1046 770
## 6 613 636 670 690 573
## Cases...Cumulative.27 Cases...Cumulative.36 Cases...Cumulative.38 Cases...Cumulative.21 Cases...Cumulative.35
## 1 487 654 696 436 634
## 2 58 125 146 39 120
## 3 33 63 67 20 61
## 4 418 606 629 332 588
## 5 797 1139 1189 665 1111
## 6 585 731 756 494 716
## Cases...Cumulative.37 Cases...Cumulative.40 geometry TotHousing
## 1 674 752 MULTIPOLYGON (((1171035 190... 11
## 2 136 160 MULTIPOLYGON (((1172812 190... 1
## 3 64 76 MULTIPOLYGON (((1173861 190... 0
## 4 613 651 MULTIPOLYGON (((1189361 187... 17
## 5 1156 1236 MULTIPOLYGON (((1167176 184... 2
## 6 743 780 MULTIPOLYGON (((1177515 185... 9
A.7 Inspect Data
A.7.1 Thematic Maps
To inspect data from a spatial perspective, we can create a series of choropleth maps.
Example 1. Number of Affordable Housing Developments per Zip Code
Choose the variable “TotHousing” to map total developments per zip code, as we calculated previously. Here we’ll map using Jenks data breaks, with a Blue to Purple palette, and four bins (or groups of data to be plotted). A histogram of the data is plotted, visualizing where breaks occured in the data to generate the map.
tmap_mode("plot")
tm_shape(Chi_Zipsf.3435) + tm_polygons("TotHousing", legend.hist = TRUE, style="jenks", pal="BuPu", n=4, title = "Housing Dev.") +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
Note: You may get an error about missing the package ggplot2
. To address this, just install that package and then rerun the command above.
Example 2. Number of COVID-19 Cases per Zip Code
Let’s do the same, but plut a different variable. Select a different variable name as your parameter in the ‘tm_fill’ parameter.
tm_shape(Chi_Zipsf.3435) + tm_polygons("Case.Rate...Cumulative",
legend.hist = TRUE, style="jenks",
pal="BuPu", n=4, title = "COVID Case Rate") +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
A.7.2 Map Overlay
Example 1. Affordable Housing Developments & Zip Code Boundaries
We previously translated the housing dataset from a CSV to a spatial object. Let’s take an attribute connect with each point, the total number of units per housing development, and visualize as a graduated symbology. Points with more units will be bigger, and not all places are weighted the same visually.
We use the “style” parameter to add a standard deviation data classification break. Check out tmap documentation for more options, like quantiles, natural jenks, or other options.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_bubbles()`: instead of `style = "sd"`, use fill.scale = `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style' to 'tm_scale_intervals(<HERE>)'
Then, let’s overlay that layer to the zip code boundary.
tm_shape(Chi_Zipsf.3435) + tm_polygons(col = "gray80") +
tm_shape(housing.3435) + tm_bubbles("Units", col = "purple")
You can also color-code according to the total number of units. Here, we’ll add a palette using a “viridis” color scheme, as a graduate color point map. For extra style, we’ll add labels to each zip code, update with a new basemap, and make the whole map interactive.
## ℹ tmap mode set to "view".
#Add Basemap
tm_basemap("Esri.WorldGrayCanvas") +
#Add First Layer, Style
tm_shape(Chi_Zipsf.3435) + tm_borders(col = "gray10") +
tm_text("zip", size = 0.7) +
#Add Second Layer, Style
tm_shape(housing.3435) +
tm_bubbles( col = "Units", style = "quantile",
pal = "viridis", size = 0.1)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_bubbles()`: instead of `style = "quantile"`, use fill.scale = `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style' to 'tm_scale_intervals(<HERE>)'[tm_bubbles()] Argument `pal` unknown.
Example 2. COVID-19 & Methadone
In the first example, let create a map showing both COVID-19 and methadone clinic data (used in A.3). First, let’s add our zip code map.
With this overlay, we’ll add a “hack” to include the methadone clinic points in a legend.
## ℹ tmap mode set to "plot".
##Add and style First Layer
tm_shape(Chi_Zipsf) + tm_polygons("Case.Rate...Cumulative",
style="jenks", pal="BuPu", n=4, title = "COVID Rt") +
##Add Second Layer
tm_shape(methadoneSf) + tm_dots(size = 0.2, col = "gray20") +
##"Hack" a manual symbology for dots in the legend
tm_add_legend("symbol", col = "gray20", size = .2, labels = "Methadone MOUD") +
##Final Cartographic Styling
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "jenks"`, use fill.scale = `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n' to 'tm_scale_intervals(<HERE>)'[v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the visual variable `fill` namely 'title' to
## 'fill.legend = tm_legend(<HERE>)'[tm_polygons()] Argument `pal` unknown.
A.8 Video Walkthrough
Here is a walkthrough of the module in video format. This presentation is provided by the author(s) of the module for your thorough understanding, so grab a cup of coffee and follow along:
Resources
For a deeper dive on topics discussed in this chapter, please check out the following. If you have a resource to add, feel free to suggest one by submitting an issue to our toolkit repository.
For tips on using
tmap
, check out the online text, Elegant and informative maps with tmap by Tennekes and Nowosad.Try out more mapping with the ggplot2 library. The Maps chapter will give you a head start.
We highly recommend Chapters 3-5 as mandatory reading in this classic, Geocomputation with R by Lovelace, Nowosad, and Muenchow. Perfecting selections and filters in the Attribute Data Operations chapter will help you become a data wrangling master. Perfect distance metrics and essential GIS operations in subsequent chapters.
The Appendix in Gimond’s Intro to GIS online book has a super overview of R examples, not to be missed.
Another superb resource is Analyzing US Census Data by Kyle Walker, with some of our favorite examples of extracing & reshaping data directly from the Census using code. Highly recommended!