A Data Wrangling in R
In this part of our toolkit, we’re going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we’ll use R code to handle our spatial data.
Getting Started
R is a great choice for starting in data science because it’s built for it. It’s not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.
We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.
Tutorials for R
If you are new to R, we recommend the following intro-level tutorials provided through installation guides. You can also refer to this R for Social Scientists tutorial developed by Data Carpentry for a refresher.
You can also visit the RStudio Education page to select a learning path tailored to your experience level (Beginners, Intermediates, Experts). They offer detailed instructions to learners at different stages of their R journey.
A.1 Environmental Setup
Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.
Download the Activity Datasets
Please download and unzip this file to get started: SDOHPlace-DataWrangling.zip
Setting Up the Working Directory
Setting up a working directory in R is crucial as it defines the location on your computer where your files and scripts will be saved and accessed. You can set the working directory to any folder on your system where you plan to store your datasets and R scripts. To set your working directory, use the setwd("/path/to/your/directory")
and specify the path to your desired directory.
Installing & Working with R Libraries
Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:
sf
: simplifies spatial data manipulationtmap
: streamlines thematic map creationdplyr
: facilitates data manipulationtidygeocoder
: converts addresses to coordinates easily
Uncomment to install packages with code snippet below. You only need to install packages once in an R environment.
Installation Tip
For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the sf
package. Using homebrew to install gdal
usually fixes any remaining issues.
Now, loading the required libraries for further steps:
A.2 Intro to Spatial Data
Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.
Load Spatial Data
We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let’s use chicagotracts.shp for practice, which includes the census tracts boundary in Chicago.
First, we need to read the shapefile data from where you save it.
Your output will look something like:
## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS: WGS 84
Always inspect data when loading in. Let’s look at a non-spatial view.
## Simple feature collection with 6 features and 9 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -87.68822 ymin: 41.72902 xmax: -87.62394 ymax: 41.87455
## Geodetic CRS: WGS 84
## commarea commarea_n countyfp10 geoid10 name10 namelsad10 notes statefp10 tractce10 geometry
## 1 44 44 031 17031842400 8424 Census Tract 8424 <NA> 17 842400 POLYGON ((-87.62405 41.7302...
## 2 59 59 031 17031840300 8403 Census Tract 8403 <NA> 17 840300 POLYGON ((-87.68608 41.8229...
## 3 34 34 031 17031841100 8411 Census Tract 8411 <NA> 17 841100 POLYGON ((-87.62935 41.8528...
## 4 31 31 031 17031841200 8412 Census Tract 8412 <NA> 17 841200 POLYGON ((-87.68813 41.8556...
## 5 32 32 031 17031839000 8390 Census Tract 8390 <NA> 17 839000 POLYGON ((-87.63312 41.8744...
## 6 28 28 031 17031838200 8382 Census Tract 8382 <NA> 17 838200 POLYGON ((-87.66782 41.8741...
Check out the data structure of this file.
## Classes 'sf' and 'data.frame': 801 obs. of 10 variables:
## $ commarea : chr "44" "59" "34" "31" ...
## $ commarea_n: num 44 59 34 31 32 28 65 53 76 77 ...
## $ countyfp10: chr "031" "031" "031" "031" ...
## $ geoid10 : chr "17031842400" "17031840300" "17031841100" "17031841200" ...
## $ name10 : chr "8424" "8403" "8411" "8412" ...
## $ namelsad10: chr "Census Tract 8424" "Census Tract 8403" "Census Tract 8411" "Census Tract 8412" ...
## $ notes : chr NA NA NA NA ...
## $ statefp10 : chr "17" "17" "17" "17" ...
## $ tractce10 : chr "842400" "840300" "841100" "841200" ...
## $ geometry :sfc_POLYGON of length 801; first list element: List of 1
## ..$ : num [1:243, 1:2] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr [1:9] "commarea" "commarea_n" "countyfp10" "geoid10" ...
The data is no longer a shapefile but an sf
object, comprised of polygons. The plot()
command in R help to quickly visualizes the geometric shapes of Chicago’s census tracts. The output includes multiple maps because the sf
framework enables previews of each attribute in our spatial file.
A.2.1 Adding a Basemap
Then, we can use tmap
, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from Chi_tracts
, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area’s census tracts.
We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5).
## tmap mode set to interactive viewing