A Data Wrangling in R

In this part of our toolkit, we’re going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we’ll use R code to handle our spatial data.

Getting Started

R is a great choice for starting in data science because it’s built for it. It’s not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.

We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.

Tutorials for R

If you are new to R, we recommend the following intro-level tutorials provided through installation guides. You can also refer to this R for Social Scientists tutorial developed by Data Carpentry for a refresher.

You can also visit the RStudio Education page to select a learning path tailored to your experience level (Beginners, Intermediates, Experts). They offer detailed instructions to learners at different stages of their R journey.

A.1 Environmental Setup

Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.

Download the Activity Datasets

Please download and unzip this file to get started: SDOHPlace-DataWrangling.zip

Setting Up the Working Directory

Setting up a working directory in R is crucial as it defines the location on your computer where your files and scripts will be saved and accessed. You can set the working directory to any folder on your system where you plan to store your datasets and R scripts. To set your working directory, use the setwd("/path/to/your/directory") and specify the path to your desired directory.

Installing & Working with R Libraries

Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:

  • sf: simplifies spatial data manipulation
  • tmap: streamlines thematic map creation
  • dplyr: facilitates data manipulation
  • tidygeocoder: converts addresses to coordinates easily

Uncomment to install packages with code snippet below. You only need to install packages once in an R environment.

#install.packages("sf", "tmap", "tidygeocoder", "dplyr")

Installation Tip

For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the sf package. Using homebrew to install gdal usually fixes any remaining issues.

Now, loading the required libraries for further steps:

library(sf)
library(dplyr)
library(tmap)

A.2 Intro to Spatial Data

Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.

Load Spatial Data

We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let’s use chicagotracts.shp for practice, which includes the census tracts boundary in Chicago.

First, we need to read the shapefile data from where you save it.

Chi_tracts = st_read("SDOHPlace-DataWrangling/chicagotracts.shp")

Your output will look something like:

## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS:  WGS 84

Always inspect data when loading in. Let’s look at a non-spatial view.

head(Chi_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.68822 ymin: 41.72902 xmax: -87.62394 ymax: 41.87455
## Geodetic CRS:  WGS 84
##   commarea commarea_n countyfp10     geoid10 name10        namelsad10 notes statefp10 tractce10                       geometry
## 1       44         44        031 17031842400   8424 Census Tract 8424  <NA>        17    842400 POLYGON ((-87.62405 41.7302...
## 2       59         59        031 17031840300   8403 Census Tract 8403  <NA>        17    840300 POLYGON ((-87.68608 41.8229...
## 3       34         34        031 17031841100   8411 Census Tract 8411  <NA>        17    841100 POLYGON ((-87.62935 41.8528...
## 4       31         31        031 17031841200   8412 Census Tract 8412  <NA>        17    841200 POLYGON ((-87.68813 41.8556...
## 5       32         32        031 17031839000   8390 Census Tract 8390  <NA>        17    839000 POLYGON ((-87.63312 41.8744...
## 6       28         28        031 17031838200   8382 Census Tract 8382  <NA>        17    838200 POLYGON ((-87.66782 41.8741...

Check out the data structure of this file.

str(Chi_tracts)
## Classes 'sf' and 'data.frame':   801 obs. of  10 variables:
##  $ commarea  : chr  "44" "59" "34" "31" ...
##  $ commarea_n: num  44 59 34 31 32 28 65 53 76 77 ...
##  $ countyfp10: chr  "031" "031" "031" "031" ...
##  $ geoid10   : chr  "17031842400" "17031840300" "17031841100" "17031841200" ...
##  $ name10    : chr  "8424" "8403" "8411" "8412" ...
##  $ namelsad10: chr  "Census Tract 8424" "Census Tract 8403" "Census Tract 8411" "Census Tract 8412" ...
##  $ notes     : chr  NA NA NA NA ...
##  $ statefp10 : chr  "17" "17" "17" "17" ...
##  $ tractce10 : chr  "842400" "840300" "841100" "841200" ...
##  $ geometry  :sfc_POLYGON of length 801; first list element: List of 1
##   ..$ : num [1:243, 1:2] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:9] "commarea" "commarea_n" "countyfp10" "geoid10" ...

The data is no longer a shapefile but an sf object, comprised of polygons. The plot() command in R help to quickly visualizes the geometric shapes of Chicago’s census tracts. The output includes multiple maps because the sf framework enables previews of each attribute in our spatial file.

plot(Chi_tracts)

A.2.1 Adding a Basemap

Then, we can use tmap, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from Chi_tracts, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area’s census tracts.

We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5).

library(tmap)

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(Chi_tracts) + tm_borders(alpha=0.5) +
  tm_layout(title = "Census Tract Map of Chicago")