A Data Wrangling in R

In this part of our toolkit, we’re going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we’ll use R code to handle our spatial data.

Getting Started

R is a great choice for starting in data science because it’s built for it. It’s not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.

We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.

Tutorials for R

If you are new to R, we recommend the following intro-level tutorials provided through installation guides. You can also refer to this R for Social Scientists tutorial developed by Data Carpentry for a refresher.

You can also visit the RStudio Education page to select a learning path tailored to your experience level (Beginners, Intermediates, Experts). They offer detailed instructions to learners at different stages of their R journey.

A.1 Environment Setup

Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.

Download the Activity Datasets

Please download and unzip this file to get started: SDOHPlace-DataWrangling.zip

Setting Up the Working Directory

Setting up a working directory in R is crucial–any file paths you use in your code, like the names of files to read data from, will be treated at relative to your current working directly. You should set the working directory to any folder on your system where you plan to store your datasets and R scripts.

To set your working directory, use

setwd("/path/to/your/directory")

To see what your working directory is currently set to, use getwd().

A.1.0.1 Example: Local directory on Windows

You can use forward slashes for directory paths:

setwd("C:/Users/susan/RWork")

or you can use double backslashes:

setwd("C:\\Users\\susan\\RWork")

A.1.0.2 Example: Local directory on macOS or Linux

Use forward slashes:

setwd("/home/susan/RWork")

You can also use ~ as an abbreviation for your own home directory:

setwd("~/RWork")

A.1.0.3 Example: Posit Cloud

Posit Cloud is a web environment that allows you to create R projects online. In a Posit project, your working directory will already be set to your project. You can confirm this like so:

getwd()
[1] "/cloud/project"

Installing & Working with R Libraries

Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:

  • sf: simplifies spatial data manipulation
  • tmap: streamlines thematic map creation
  • dplyr: facilitates data manipulation
  • tidygeocoder: converts addresses to coordinates easily

You only need to install packages once in an R environment. Use these commands to install the packages needed for this tutorial.

install.packages("sf")
install.packages("tmap")
install.packages("tidygeocoder")
install.packages("dplyr")

Installation Tip

For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the sf package. Using homebrew to install gdal usually fixes any remaining issues.

Now, loading the required libraries for further steps:

library(sf)
library(dplyr)
library(tmap)

A.2 Intro to Spatial Data

Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.

Load Spatial Data

We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let’s use chicagotracts.shp for practice, which includes the census tracts boundary in Chicago.

First, we need to read the shapefile data from where you save it.

Chi_tracts = st_read("SDOHPlace-DataWrangling/chicagotracts.shp")

Your output will look something like:

## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS:  WGS 84

Always inspect data when loading in. Let’s look at a non-spatial view.

head(Chi_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.68822 ymin: 41.72902 xmax: -87.62394 ymax: 41.87455
## Geodetic CRS:  WGS 84
##   commarea commarea_n countyfp10     geoid10 name10        namelsad10 notes statefp10 tractce10
## 1       44         44        031 17031842400   8424 Census Tract 8424  <NA>        17    842400
## 2       59         59        031 17031840300   8403 Census Tract 8403  <NA>        17    840300
## 3       34         34        031 17031841100   8411 Census Tract 8411  <NA>        17    841100
## 4       31         31        031 17031841200   8412 Census Tract 8412  <NA>        17    841200
## 5       32         32        031 17031839000   8390 Census Tract 8390  <NA>        17    839000
## 6       28         28        031 17031838200   8382 Census Tract 8382  <NA>        17    838200
##                         geometry
## 1 POLYGON ((-87.62405 41.7302...
## 2 POLYGON ((-87.68608 41.8229...
## 3 POLYGON ((-87.62935 41.8528...
## 4 POLYGON ((-87.68813 41.8556...
## 5 POLYGON ((-87.63312 41.8744...
## 6 POLYGON ((-87.66782 41.8741...

Check out the data structure of this file.

str(Chi_tracts)
## Classes 'sf' and 'data.frame':   801 obs. of  10 variables:
##  $ commarea  : chr  "44" "59" "34" "31" ...
##  $ commarea_n: num  44 59 34 31 32 28 65 53 76 77 ...
##  $ countyfp10: chr  "031" "031" "031" "031" ...
##  $ geoid10   : chr  "17031842400" "17031840300" "17031841100" "17031841200" ...
##  $ name10    : chr  "8424" "8403" "8411" "8412" ...
##  $ namelsad10: chr  "Census Tract 8424" "Census Tract 8403" "Census Tract 8411" "Census Tract 8412" ...
##  $ notes     : chr  NA NA NA NA ...
##  $ statefp10 : chr  "17" "17" "17" "17" ...
##  $ tractce10 : chr  "842400" "840300" "841100" "841200" ...
##  $ geometry  :sfc_POLYGON of length 801; first list element: List of 1
##   ..$ : num [1:243, 1:2] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:9] "commarea" "commarea_n" "countyfp10" "geoid10" ...

The data is no longer a shapefile but an sf object, comprised of polygons. The plot() command in R help to quickly visualizes the geometric shapes of Chicago’s census tracts. The output includes multiple maps because the sf framework enables previews of each attribute in our spatial file.

plot(Chi_tracts)

A.2.1 Adding a Basemap

Then, we can use tmap, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from Chi_tracts, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area’s census tracts.

We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5).

library(tmap)

tmap_mode("view")
## ℹ tmap mode set to "view".
tm_shape(Chi_tracts) + tm_borders(alpha=0.5) +
  tm_layout(title = "Census Tract Map of Chicago")
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────────────────────────────────────────────────
## [v3->v4] `tm_borders()`: use `fill_alpha` instead of `alpha`.[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`