A Data Wrangling in R
In this part of our toolkit, we’re going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we’ll use R code to handle our spatial data.
Getting Started
R is a great choice for starting in data science because it’s built for it. It’s not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.
We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.
Tutorials for R
If you are new to R, we recommend the following intro-level tutorials provided through installation guides. You can also refer to this R for Social Scientists tutorial developed by Data Carpentry for a refresher.
You can also visit the RStudio Education page to select a learning path tailored to your experience level (Beginners, Intermediates, Experts). They offer detailed instructions to learners at different stages of their R journey.
A.1 Environmental Setup
Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.
Download the Activity Datasets
Please download and unzip this file to get started: SDOHPlace-DataWrangling.zip
Setting Up the Working Directory
Setting up a working directory in R is crucial as it defines the location on your computer where your files and scripts will be saved and accessed. You can set the working directory to any folder on your system where you plan to store your datasets and R scripts. To set your working directory, use the setwd("/path/to/your/directory")
and specify the path to your desired directory.
Installing & Working with R Libraries
Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:
sf
: simplifies spatial data manipulationtmap
: streamlines thematic map creationdplyr
: facilitates data manipulationtidygeocoder
: converts addresses to coordinates easily
Uncomment to install packages with code snippet below. You only need to install packages once in an R environment.
Installation Tip
For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the sf
package. Using homebrew to install gdal
usually fixes any remaining issues.
Now, loading the required libraries for further steps:
A.2 Intro to Spatial Data
Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.
Load Spatial Data
We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let’s use chicagotracts.shp for practice, which includes the census tracts boundary in Chicago.
First, we need to read the shapefile data from where you save it.
Your output will look something like:
## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS: WGS 84
Always inspect data when loading in. Let’s look at a non-spatial view.
## Simple feature collection with 6 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.68822 ymin: 41.72902 xmax: -87.62394 ymax: 41.87455
## Geodetic CRS: WGS 84
## commarea commarea_n countyfp10 geoid10 name10 namelsad10 notes statefp10 tractce10 geometry
## 1 44 44 031 17031842400 8424 Census Tract 8424 <NA> 17 842400 POLYGON ((-87.62405 41.7302...
## 2 59 59 031 17031840300 8403 Census Tract 8403 <NA> 17 840300 POLYGON ((-87.68608 41.8229...
## 3 34 34 031 17031841100 8411 Census Tract 8411 <NA> 17 841100 POLYGON ((-87.62935 41.8528...
## 4 31 31 031 17031841200 8412 Census Tract 8412 <NA> 17 841200 POLYGON ((-87.68813 41.8556...
## 5 32 32 031 17031839000 8390 Census Tract 8390 <NA> 17 839000 POLYGON ((-87.63312 41.8744...
## 6 28 28 031 17031838200 8382 Census Tract 8382 <NA> 17 838200 POLYGON ((-87.66782 41.8741...
Check out the data structure of this file.
## Classes 'sf' and 'data.frame': 801 obs. of 10 variables:
## $ commarea : chr "44" "59" "34" "31" ...
## $ commarea_n: num 44 59 34 31 32 28 65 53 76 77 ...
## $ countyfp10: chr "031" "031" "031" "031" ...
## $ geoid10 : chr "17031842400" "17031840300" "17031841100" "17031841200" ...
## $ name10 : chr "8424" "8403" "8411" "8412" ...
## $ namelsad10: chr "Census Tract 8424" "Census Tract 8403" "Census Tract 8411" "Census Tract 8412" ...
## $ notes : chr NA NA NA NA ...
## $ statefp10 : chr "17" "17" "17" "17" ...
## $ tractce10 : chr "842400" "840300" "841100" "841200" ...
## $ geometry :sfc_POLYGON of length 801; first list element: List of 1
## ..$ : num [1:243, 1:2] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr [1:9] "commarea" "commarea_n" "countyfp10" "geoid10" ...
The data is no longer a shapefile but an sf
object, comprised of polygons. The plot()
command in R help to quickly visualizes the geometric shapes of Chicago’s census tracts. The output includes multiple maps because the sf
framework enables previews of each attribute in our spatial file.
A.2.1 Adding a Basemap
Then, we can use tmap
, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from Chi_tracts
, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area’s census tracts.
We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5).
## tmap mode set to interactive viewing
Still in the interactive mode (`view’), we can switch to a different basemap. Here we bring in a “Voyager” style map from Carto, a cartographic firm. We’ll make the borders less transparent by adjusting the alpha level.
## tmap mode set to interactive viewing
tm_basemap("CartoDB.Voyager") +
tm_shape(Chi_tracts) + tm_borders(alpha=0.8, col = "gray40") +
tm_layout(title = "Census Tract Map of Chicago")
Tip
For additional options, you can preview basemaps at the Leaflet Providers Demo. Some basemaps we recommend that work consistently are by:
- CartoDB (Carto)
- Open Street Map
- ESRI
Not all basemaps are available anymore, and some require keys that you’d need to add on your own.
A.3 Coordinate Reference Systems
For this exercise we will use chicagotracts.shp
to explore how to change the projection of a spatial dataset in R. First, let’s check out the current coordinate reference system.
## Coordinate Reference System:
## User input: WGS 84
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]
We can use the st_transform
function to transform CRS. When projecting a dataset of Illinois, the most appropriate NAD83 projection would be NAD83 UTM zone 16N. Chicago sits within the area best covered by NAD83 / Illinois East (ftUS) (EPSG:3435).
Chi_tracts.3435 <- st_transform(Chi_tracts, "EPSG:3435")
# Chi_tracts.3435 <- st_transform(Chi_tracts, 3435)
st_crs(Chi_tracts.3435)
## Coordinate Reference System:
## User input: EPSG:3435
## wkt:
## PROJCRS["NAD83 / Illinois East (ftUS)",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["SPCS83 Illinois East zone (US survey foot)",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",36.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-88.3333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.999975,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",984250,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United States (USA) - Illinois - counties of Boone; Champaign; Clark; Clay; Coles; Cook; Crawford; Cumberland; De Kalb; De Witt; Douglas; Du Page; Edgar; Edwards; Effingham; Fayette; Ford; Franklin; Gallatin; Grundy; Hamilton; Hardin; Iroquois; Jasper; Jefferson; Johnson; Kane; Kankakee; Kendall; La Salle; Lake; Lawrence; Livingston; Macon; Marion; Massac; McHenry; McLean; Moultrie; Piatt; Pope; Richland; Saline; Shelby; Vermilion; Wabash; Wayne; White; Will; Williamson."],
## BBOX[37.06,-89.27,42.5,-87.02]],
## ID["EPSG",3435]]
After change the projection, we can plot the map. We’ll swith to the static version of tmap, using the plot
mode.
## tmap mode set to plotting
tm_shape(Chi_tracts.3435) + tm_borders(alpha=0.5) +
tm_layout(main.title ="EPSG:3435 (ft)",
main.title.position = "center")
What if we had used the wrong EPSG code, referencing the wrong projection? Here we’ll transform and plot EPSG Code 3561, a coordinate reference system in Haw’aii.
Chi_tracts.3561 <- st_transform(Chi_tracts, "EPSG:3561")
tm_shape(Chi_tracts.3561) + tm_borders(alpha=0.5) +
tm_layout(main.title ="EPSG:3561 (ft)",
main.title.position = "center")
It’s obviously not correct – the wrong CRS can cause a load of trouble. Make sure you specify carefully!
Refine Basic Map
Let’s take a deeper look at the cartographic mapping package, tmap
. We approach mapping with one layer at a time. Always start with the object you want to map by calling it with the tm_shape
function. Then, at least one descriptive/styling function follows. There are hundreds of variations and paramater specification.
Here we style the tracts with some semi-transparent borders.
Next we fill the tracts with a light gray, and adjust the color and transparency of borders. We also add a scale bar, positioning it to the left and having a thickness of 0.8 units, and turn off the frame.
tm_shape(Chi_tracts) + tm_fill(col = "gray90") + tm_borders(alpha=0.2, col = "gray10") +
tm_scale_bar(position = ("left"), lwd = 0.8) +
tm_layout(frame = F)
Check out https://rdrr.io/cran/tmap/man/tm_polygons.html for more ideas.
A.4 Converting to Spatial Data
A.4.1 Convert CSVs to Spatial Data
We are using the Affordable_Rental_Housing_Developments.csv
in the dataset to show how to convert a csv Lat/Long data to points. First, we need to load the CSV data.
Then, we need to ensure that no column (intended to be used as a coordinate) is entirely empty or filled with NA values.
Inspect the data to confirm it’s doing what you expect it to be doing. What columns will you use to specify the coordinates? In this dataset, we have multiple coordinate options. We’ll use latitude and longitude, or rather, longitude as our X value, and latitude as our Y value. In the data, it’s specified as “Longitude” and “Latitude.”
## Community.Area.Name Community.Area.Number Property.Type Property.Name Address Zip.Code Phone.Number
## 2 Rogers Park 1 Senior Morse Senior Apts. 6928 N. Wayne Ave. 60626 312-602-6207
## 3 Uptown 3 ARO The Draper 5050 N. Broadway 60640 312-818-1722
## 4 Edgewater 77 Senior Pomeroy Apts. 5650 N. Kenmore Ave. 60660 773-275-7820
## 5 Roseland 49 Supportive Housing Wentworth Commons 11045 S. Wentworth Ave. 60628 773-568-7804
## 6 Humboldt Park 23 Multifamily Nelson Mandela Apts. 607 N. Sawyer Ave. 60624 773-227-6332
## 7 Grand Boulevard 38 Multifamily Legends South - Gwendolyn Place 4333 S. Michigan Ave. 60653 773-624-7676
## Management.Company Units X.Coordinate Y.Coordinate Latitude Longitude Location
## 2 Morse Urban Dev. 44 1165844 1946059 42.00757 -87.66517 (42.0075737709331, -87.6651711448293)
## 3 Flats LLC 35 1167357 1933882 41.97413 -87.65996 (41.9741295261027, -87.6599553011627)
## 4 Habitat Company 198 1168181 1937918 41.98519 -87.65681 (41.9851867755403, -87.656808676983)
## 5 Mercy Housing Lakefront 50 1176951 1831516 41.69302 -87.62777 (41.6930159120977, -87.6277673462214)
## 6 Bickerdike Apts. 6 1154640 1903912 41.89215 -87.70753 (41.8921534052465, -87.7075265659001)
## 7 Interstate Realty Management Co. 71 1177924 1876178 41.81555 -87.62286 (41.815550396096, -87.6228565224104)
Finally, we start to convert it to points. Be sure you use the CRS of the original coordinates recorded. In this case we weren’t sure what CRS that was, so we use EPSG:4326
to test.
View the resulting sf object with a basemap to confirm they are in the right place. Overlay them on top of the tract data, to confirm they are plotting correctly.
### First Layer
tm_shape(Chi_tracts) + tm_borders(lwd = 0.5) +
### Second Layer
tm_shape(points_housing) + tm_dots(size = 0.1 )
You can change the tmap_mode
to “view” to add a basemap in an interactive setting, and then switch back to “plot” when complete. Because we’e plotting dots using tmap, we’ll use the tm_dots
parameter for styling.
## tmap mode set to interactive viewing
## tmap mode set to plotting
We’ll reproject to EPSG:3435
, our system standard
A.4.1.1 Write Spatial Data
Finally, we can save our points as a spatial dataset. Use ‘st_write’ to write your spatial object in R to a data format of your choice. Here, we’ll write to a geojson file.
Uncomment to run this line.
You could also save as a “housing.shapefile” to get a shapefile format, however you’ll get an error noting that some column names are too long and must be shortened. Shapefile formats have a limit of 10 characters for field names.
The file may still write, but the column names that were too long may be shortened automatically.
To change column or field names in R objects, there are dozens of options. Try searching and “googling” different search terms to identify solutions on your own.
A.4.2 Geocode Addresses
Here, we will use chicago_methadone_nogeometry.csv
for practice, which includes methadone centers in Chicago (center names and addresses). First we load the tidygeocoder
to get our geocoding done.
Let’s read in and inspect data for methadone maintenance providers. Note, these addresses were made available by SAMSHA, and are known as publicly available information. An additional analysis could call each service to check on access to medication during COVID in Septmber 2020, and the list would be updated further.
methadoneClinics <- read.csv("SDOHPlace-DataWrangling/chicago_methadone_nogeometry.csv")
head(methadoneClinics)
## X Name Address City State Zip
## 1 1 Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago IL 60640
## 2 2 Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago IL 60640
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview 3934 N. Lincoln Ave. Chicago IL 60613
## 4 4 PDSSC - Chicago, Inc. 2260 N. Elston Ave. Chicago IL 60614
## 5 5 Center for Addictive Problems, Inc. 609 N. Wells St. Chicago IL 60654
## 6 6 Family Guidance Centers, Inc. 310 W. Chicago Ave. Chicago IL 60654
Let’s geocode one address first, just to make sure our system is working. We’ll use the “cascade” method which use the US Census and OpenStreetMap geocoders. These two services are the main options with tidygeocoder
.
sample <- geo("2260 N. Elston Ave. Chicago, IL", lat = latitude, long = longitude, method = 'cascade')
## Warning: The `method` argument of `geo()` cannot be "cascade" as of tidygeocoder 1.0.4.
## ℹ Please use `geocode_combine()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Passing 1 address to the US Census single address geocoder
## Query completed in: 0.7 seconds
## # A tibble: 1 × 4
## address latitude longitude geo_method
## <chr> <dbl> <dbl> <chr>
## 1 2260 N. Elston Ave. Chicago, IL 41.9 -87.7 census
As we prepare for geocoding, check out the structure of the dataset. The data should be a character to be read properly.
## 'data.frame': 27 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Chicago Treatment and Counseling Center, Inc." "Sundace Methadone Treatment Center, LLC" "Soft Landing Interventions/DBA Symetria Recovery of Lakeview" "PDSSC - Chicago, Inc." ...
## $ Address: chr "4453 North Broadway st." "4545 North Broadway St." "3934 N. Lincoln Ave." "2260 N. Elston Ave." ...
## $ City : chr "Chicago" "Chicago" "Chicago" "Chicago" ...
## $ State : chr "IL" "IL" "IL" "IL" ...
## $ Zip : int 60640 60640 60613 60614 60654 60654 60651 60607 60607 60616 ...
We need to clean the data a bit. We’ll add a new column for a full address, as required by the geocoding service. When you use a geocoding service, be sure to read the documentation and understand how the data needs to be formatted for input.
methadoneClinics$fullAdd <- paste(as.character(methadoneClinics$Address),
as.character(methadoneClinics$City),
as.character(methadoneClinics$State),
as.character(methadoneClinics$Zip))
We’re ready to go! Batch geocode with one function, and inspect:
geoCodedClinics <- geocode(methadoneClinics,
address = 'fullAdd', lat = latitude, long = longitude, method = 'cascade')
## Passing 27 addresses to the US Census batch geocoder
## Query completed in: 0.5 seconds
## # A tibble: 6 × 10
## X Name Address City State Zip fullAdd latitude longitude geo_method
## <int> <chr> <chr> <chr> <chr> <int> <chr> <dbl> <dbl> <chr>
## 1 1 Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago IL 60640 4453 Nort… 42.0 -87.7 census
## 2 2 Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago IL 60640 4545 Nort… 42.0 -87.7 census
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview 3934 N. Lincoln Ave. Chicago IL 60613 3934 N. L… 42.0 -87.7 census
## 4 4 PDSSC - Chicago, Inc. 2260 N. Elston Ave. Chicago IL 60614 2260 N. E… 41.9 -87.7 census
## 5 5 Center for Addictive Problems, Inc. 609 N. Wells St. Chicago IL 60654 609 N. We… 41.9 -87.6 census
## 6 6 Family Guidance Centers, Inc. 310 W. Chicago Ave. Chicago IL 60654 310 W. Ch… 41.9 -87.6 census
There were two that didn’t geocode correctly. You can inspect further. This could involve a quick check for spelling issues; or, searching the address and pulling the lat/long using Google Maps and inputting manually. Or, if we are concerned it’s a human or unknown error, we could omit. For this exercise we’ll just omit the two clinics that didn’t geocode correctly.
A.5 Convert to Spatial Data
This is not spatial data yet! To convert a static file to spatial data, we use the powerful st_as_sf
function from sf
. Indicate the x,y parameters (=longitude, latitude) and the coordinate reference system used. Our geocoding service used the standard EPSG:4326, so we input that here.
Basic Map of Points
For a really simple map of points – to ensure they were geocoded and converted to spatial data correctly, we use tmap
. We’ll use the interactive version to view.
If your points didn’t plot correctly:
- Did you flip the longitude/latitude values?
- Did you input the correct CRS?
Those two issues are the most common errors.
A.6 Merge Data sets
Reshape Data
Here, we are trying to use the COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv
dataset to practice how to convert long data to a wide data format.
We subset to the first two columns, and the sixth column. That gives us the zip code, the reporting week, and cumalative cases of Covid-19. We want each zip code to be a unique row, with cases by week as a columns Choose whatever subset functioon you prefer best!
covid = read.csv("SDOHPlace-DataWrangling/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")
covid_clean = covid[,c(1:2, 6)]
head(covid_clean)
## ZIP.Code Week.Number Cases...Cumulative
## 1 60603 39 13
## 2 60604 39 31
## 3 60611 16 72
## 4 60611 15 64
## 5 60615 11 NA
## 6 60603 10 NA
Now, we are trying to create a wide data set with the cumulative cases for each week for each zip code. Enter the code and you will see the new wide data format.
covid_wide <- reshape(covid_clean, direction = "wide",
idvar = "ZIP.Code", timevar = "Week.Number")
head(covid_wide)
## ZIP.Code Cases...Cumulative.39 Cases...Cumulative.16 Cases...Cumulative.15 Cases...Cumulative.11 Cases...Cumulative.10 Cases...Cumulative.12
## 1 60603 13 NA NA NA NA NA
## 2 60604 31 NA NA NA NA NA
## 3 60611 458 72 64 NA NA 16
## 5 60615 644 171 132 NA NA 26
## 31 60605 391 93 65 NA NA 23
## 32 Unknown 240 23 18 NA NA NA
## Cases...Cumulative.13 Cases...Cumulative.14 Cases...Cumulative.34 Cases...Cumulative.17 Cases...Cumulative.18 Cases...Cumulative.19
## 1 NA NA 11 NA NA NA
## 2 NA NA 29 6 11 14
## 3 41 57 352 80 92 99
## 5 57 99 567 215 243 274
## 31 39 52 325 118 135 149
## 32 NA 9 162 33 53 62
## Cases...Cumulative.20 Cases...Cumulative.31 Cases...Cumulative.22 Cases...Cumulative.23 Cases...Cumulative.24 Cases...Cumulative.25
## 1 5 9 6 6 6 6
## 2 17 25 22 23 24 24
## 3 114 286 139 148 152 163
## 5 302 526 353 364 376 388
## 31 155 291 187 194 198 215
## 32 69 127 97 102 106 112
## Cases...Cumulative.28 Cases...Cumulative.29 Cases...Cumulative.30 Cases...Cumulative.32 Cases...Cumulative.33 Cases...Cumulative.26
## 1 6 8 9 10 11 6
## 2 25 25 25 25 25 25
## 3 223 240 264 305 333 175
## 5 444 482 506 539 556 401
## 31 247 263 277 304 312 229
## 32 120 124 126 132 142 113
## Cases...Cumulative.27 Cases...Cumulative.36 Cases...Cumulative.38 Cases...Cumulative.21 Cases...Cumulative.35 Cases...Cumulative.37
## 1 6 11 13 6 11 13
## 2 25 31 31 20 30 31
## 3 196 391 435 124 371 411
## 5 418 606 629 332 588 613
## 31 235 354 379 169 333 364
## 32 116 186 216 92 178 201
## Cases...Cumulative.40
## 1 14
## 2 31
## 3 478
## 5 651
## 31 399
## 32 288
Join by Attribute
Here, we’ll merge data sets with a common variable in R. Merging the cumulative case data set you created in the last section to zip code spatial data (ChiZipMaster1.geojson
) will allow you to map the case data. You’ll be merging the case data and spatial data using the zip codes field of each dataset.
We’ve cleaned our covid case data already, but not all values under the zipcode column are valid. There is a row has a value of “unkown”, so let’s remove that.
Then, we need to load the zipcode data.
Your output will look something like:
## Reading layer `ChiZipMaster1' from data source `./SDOHPlace-DataWrangling/ChiZipMaster1.geojson' using driver `GeoJSON'
## Simple feature collection with 540 features and 31 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.87596 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS: WGS 84
You’ll notice that the zip codes are repeated in the zip code data set, and needs to be cleaned before we can continue with merging the data.
zipcode_unique <- distinct(zipcode)
zipcode_unique <- zipcode %>%
group_by(zip) %>%
slice(1) %>%
ungroup()
Now, the two datasets are ready to join together by the zipcode. Make sure to check they have been joined successully.
We’ll have these joined datasets be called Chi_Zipsf, to denote a final zip code master dataset.
We’ll reproject to EPSG:3435, the standard used in our study area.
Join by Location
We’ll create a spatial join with the housing and zip code data we’ve brought in.
In this example, we want to join zip-level data to the Rental Housing Developments, so we can identify which zips they are within.
First, let’s try “sticking” all the zip code data too the housing points, intersecting zip codes with housing developments.
To do this, both datasets will need to be in the same CRS. We have already standardized both using EPSG:3435.
Don’t forget to inspect the data. Uncomment to explore!
We could also flip things around, and try to count how many developments intersect each zip code We can use lengths() to find out how many items are present in a vector. Here,
Chi_Zipsf.3435$TotHousing <- lengths(st_intersects(Chi_Zipsf.3435, housing.3435))
head(Chi_Zipsf.3435)
## Simple feature collection with 6 features and 63 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1173037 ymin: 1889918 xmax: 1183259 ymax: 1902960
## Projected CRS: NAD83 / Illinois East (ftUS)
## # A tibble: 6 × 64
## zip objectid shape_area shape_len Case.Rate...Cumulative year totPopE whiteP blackP amIndP asianP pacIsP otherP hispP noHSP age0_4 age5_14 age15_19
## <chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 60601 27 9166246. 19805. 1451. 2018 14675 74.2 5.57 0.45 18 0 1.81 8.68 0 550 156 907
## 2 60602 26 4847125. 14448. 1688. 2018 1244 68.2 3.78 5.31 19.4 0 3.3 6.51 0 61 87 18
## 3 60603 19 4560229. 13673. 1107. 2018 1174 63.5 3.24 0 27.6 0 5.71 9.8 0 13 43 179
## 4 60604 48 4294902. 12246. 3964. 2018 782 63.4 5.63 0 29.7 0 1.28 4.35 0 12 7 52
## 5 60605 20 36301276. 37973. 1421. 2018 27519 61.2 17.2 0.18 16.1 0.03 5.31 5.84 2.39 837 1279 2172
## 6 60606 31 6766411. 12040. 2290. 2018 3101 72.8 2.35 0 18.1 0 6.8 6.29 0.73 57 44 0
## # ℹ 46 more variables: age20_24 <int>, age15_44 <int>, age45_49 <int>, age50_54 <int>, age55_59 <int>, age60_64 <int>, ageOv65 <int>, ageOv18 <int>,
## # age18_64 <int>, a15_24P <dbl>, und45P <dbl>, ovr65P <dbl>, disbP <dbl>, geometry <MULTIPOLYGON [US_survey_foot]>, Cases...Cumulative.39 <int>,
## # Cases...Cumulative.16 <int>, Cases...Cumulative.15 <int>, Cases...Cumulative.11 <int>, Cases...Cumulative.10 <int>, Cases...Cumulative.12 <int>,
## # Cases...Cumulative.13 <int>, Cases...Cumulative.14 <int>, Cases...Cumulative.34 <int>, Cases...Cumulative.17 <int>, Cases...Cumulative.18 <int>,
## # Cases...Cumulative.19 <int>, Cases...Cumulative.20 <int>, Cases...Cumulative.31 <int>, Cases...Cumulative.22 <int>, Cases...Cumulative.23 <int>,
## # Cases...Cumulative.24 <int>, Cases...Cumulative.25 <int>, Cases...Cumulative.28 <int>, Cases...Cumulative.29 <int>, Cases...Cumulative.30 <int>,
## # Cases...Cumulative.32 <int>, Cases...Cumulative.33 <int>, Cases...Cumulative.26 <int>, Cases...Cumulative.27 <int>, Cases...Cumulative.36 <int>, …
A.7 Inspect Data
A.7.1 Thematic Maps
To inspect data from a spatial perspective, we can create a series of choropleth maps.
Example 1. Number of Affordable Housing Developments per Zip Code
Choose the variable “TotHousing” to map total developments per zip coode, as we calculated previously. Here we’ll map using Jenks data breaks, with a Blue to Purple palette, and four bins (or groups of data to be plotted). A histogram of the data is plotted, visualizing where breaks occured in the data to generate the map.
A.7.2 Map Overlay
Example 1. Afforfable Housing Developments & Zipcode Boundaries
We previously translated the housing dataset from a CSV to a spatial object. Let’s take an attribute connect with each point, the total number of units per housing development, and visualize as a graduated symbology. Points with more units will be bigger, and not all places are weighted the same visually.
We use the “style” parameter to aadd a standard deviation data classification break. Check out tmap documentation for more options, like quantiles, natural jenks, or other options.
Then, let’s overlay that layer to the zipcode boundary.
tm_shape(Chi_Zipsf.3435) + tm_polygons(col = "gray80") +
tm_shape(housing.3435) + tm_bubbles("Units", col = "purple")
You can also color-code according to the total number of units. Here, we’ll add a palette using a “viridis” color scheme, as a graduate color point map. For extra style, we’ll add labels to each zip code, update with a new basemap, and make the whole map interactive.
## tmap mode set to interactive viewing
Example 2. COVID-19 & Methadone
In the first example, let create a map showing both COVID-19 and methadone clinic data (used in A.3). First, let’s add our zipcode map.
With this overlay, we’ll add a “hack” to include the methadone clinic points in a legend.
## tmap mode set to plotting
##Add and style First Layer
tm_shape(Chi_Zipsf) + tm_polygons("Case.Rate...Cumulative",
style="jenks", pal="BuPu", n=4, title = "COVID Rt") +
##Add Second Layer
tm_shape(methadoneSf) + tm_dots(size = 0.2, col = "gray20") +
##"Hack" a manual symbology for dots in the legend
tm_add_legend("symbol", col = "gray20", size = .2, labels = "Methadone MOUD") +
##Final Cartographic Styling
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
Resources
For a deeper dive on topics discussed in this chapter, please check out the following. If you have a resource to add, feel free to suggest one by submitting an issue to our toolkit repository.
For tips on using
tmap
, check out the online text, Elegant and informative maps with tmap by Tennekes and Nowosad.Try out more mapping with the ggplot2 library. The Maps chapter will give you a head start.
We highly recommend Chapters 3-5 as mandatory reading in this classic, Geocomputation with R by Lovelace, Nowosad, and Muenchow. Perfecting selections and filters in the Attribute Data Operations chapter will help you become a data wrangling master. Perfect distance metrics and essential GIS operations in subsequent chapters.
The Appendix in Gimond’s Intro to GIS online book has a super overview of R examples, not to be missed.
Another superb resource is Analyzing US Census Data by Kyle Walker, with some of our favorite examples of extracing & reshaping data directly from the Census using code. Highly recommended!