Jerid Francom bio photo

Jerid Francom

Associate Professor of Spanish and Linguistics
Romance Languages
Wake Forest University

Curriculum vitae

Email Twitter Github Stackoverflow Last.fm

I’m currently working on a project involving Twitter posts and demographics. One of the best resources for demographic information in the US is the census. Having not worked with US census data in a very long time, I was excited to see that there is an R package available to make the process easier.

In this exploRation I will provide a tutorial on 1) how to acquire the US census data and other demographic data through American Fact Finder, 2) how to visualize the data in regional choropleths, 3) how to overlay geo-tagged tweets, and finally 4) how to display the map as an interactive plot.

In this post I’ll tackle the first two points and leave Twitter data and interactive plots for a follow up post.

A handy package for working with US Census data in R is the UScensus2010 package. It is available on CRAN and can be installed in the normal way.

install.packages("UScensus2010")

It is a helper package to interface the spatial and demographic data that is available in a series of other packages dedicated to varying political and statistical regions: namely UScensus2010tract, UScensus2010cdp, UScensus2010county, UScensus2010blkgrp, and UScensus2010blk.

In this tutorial I’ll be working with the “tract” level [US Census Tract description]. A tract is a small subdivision of a county. In this particular case I want to explore the Tucson, Arizona metropolitain area. County would be too wide, and city boundaries too narrow.

UScensus2010 provides an installer fuction install.tract() which should install the tract data on your machine. I found problems, however, with the installer, and had to do some poking around on the web. Luckily, I found a repository where the data can be manually downloaded and installed.

url <- "http://lakshmi.calit2.uci.edu/census2000/R/src/contrib/UScensus2010tract_1.00.tar.gz"
install.packages(url, repos = NULL)

With the tract data downloaded and installed, it can be loaded using the city() or county() function. Again, I want county-level information here. Tucson belongs to “Pima” county. [I’ve specified level = "tract" to get access to the tract data that we installed, but if you have downloaded other UScensus package data you can specify the which you want to pull here.]

library(UScensus2010)
pima.tract <- UScensus2010::county(name = "pima", state = "arizona", level = "tract")
## 
## UScensus2010tract: US Census 2010 Tract Level Shapefiles and Additional Demographic
## Data 
## Version 1.00 created on 2011-11-06 
## copyright (c) 2011, Zack W. Almquist, University of California-Irvine
## Type help(package="UScensus2010tract") to get started.
## 
## For citation information, type citation("UScensus2010tract").

The resulting data in pima.tract is a SpatialPolygonsDataFrame grouping demographic data and polygon coordinates by tract id.

A map can be generated in various ways. First, base R’s plot() function will produce a quick and dirty view of the tracts for Pima county.

plot(pima.tract)

center

UScensus2010 also provides various plotting functions. Of these the choropleth.ssplot() function is an easy way to generate a choropleth. Using the defaults, the “Total population” (P0010001) variable is used to fill the tracts. For more information on demographic variables

choropleth.spplot(pima.tract)

center

If you are like me, I am more comfortable working with ggplot2. To work with this data, however, it needs to be converted to a data.frame using the fortify()UPDATED: 2016-03-08 tidy() function from the broom package. Below is a function that carries out the conversion from sp object to data.frame object.

Thanks to Andy Bush for the heads up on switching to tidy() before fortify() depreciates.

sp2df <- function(data) {
  data@data$id <- rownames(data@data)
  data.points = broom::tidy(data, region = "id")
  data.df = plyr::join(data.points, data@data, by = "id")
  return(data.df)
}

Then we create the data.frame version of the data.

pima.tract.df <- sp2df(pima.tract)

Let’s produce a simple choropleth with P0010001 (“Total population”) as the fill aesthetic as a above.

p <- ggplot2::ggplot(pima.tract.df, 
                aes(x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = P0010001), 
               colour = alpha("white", 1/2), size = 0.5) +
  theme(legend.position = "bottom") +
  labs(fill = "Total population")
p

center

The county plot is extremely dense around the area of interest –the Tucson metropolitain area. With ggplot2 we can subset this plot with coord_map(). But to use this we are going to need the coordinates for the bounding box. I found a great site which provides a easy-peasy interface for doing just such a thing.

bbox

p + ggplot2::coord_map(xlim=-c(111.2, 110.7), ylim=c(32.1, 32.5))

center

There is an extensive amount of demographic information that is available in the sp object that county() creates. But there is much more information available on the American FactFinder site. I am working with a project which aims to look at language use information so I selected the tract information for Pima county on from the American Community Survey (ACS) program’s 2014 5-year estimates and downloaded the “AGE BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER” (B16007) data.

The second row of the csv file contains helpful descriptions of the variables, but I dropped it before loading it into my R session.

pima.lang.df <- readr::read_csv("census2010/pima/ACS_14_5YR_B16007.csv", 
                col_names = TRUE)

Now what I want to do is join the pima.tract.df and the new pima.lang.df datasets. The common key between both sets is the $fips (Federal Information Processing Standard) codes. Yet in the pima.lang.df data this is listed as $GEO.id2. Furthermore, in the pima.tract.df the $fips variable is of type character, and needs to be numeric for direct merging.

pima.lang.df <- plyr::rename(pima.lang.df, c("GEO.id2"="fips"))
pima.tract.df$fips <- as.numeric(pima.tract.df$fips)

pima.data <- plyr::join(pima.tract.df, pima.lang.df)

# remove tracts with incomplete data
pima.data <- na.omit(pima.data) 

To look at the Spanish-speaking population, I pull the estimates for Spanish-speaking 5 to 17, 18 to 64, and 65 plus and add them together and then divide this vector by the total estimate of speakers in the tract.

estimate.span.5.17 <- pima.data$HD01_VD04 # 5 to 17
estimate.span.18.64 <- pima.data$HD01_VD10 # 18 to 64
estimate.span.65.plus <- pima.data$HD01_VD16 # 65 plus

span.total <- estimate.span.5.17 + estimate.span.18.64 + estimate.span.65.plus
speakers.total <- pima.data$HD01_VD01

We can use this information to fill the choropleth and convert the vector scores into percentages per tract.

p <- ggplot(pima.data, 
            aes(long, lat, group = group)) +
  geom_polygon(aes(fill = (span.total/speakers.total)*100), 
               colour = alpha("white", 1/2), size = 0.2) + 
  scale_fill_gradient(name="% Spanish speaking") +
  coord_map(xlim=-c(111.2, 110.7), 
            ylim=c(32.1, 32.5)) + 
  theme(legend.position = "bottom")
p

center

Getting a bit more fancy we can overlay this plot on a roadmap generated by Google.

library(ggmap)
pima <- get_map(location = geocode(location = "tucson"), 
                source = "google", 
                zoom = 10, 
                maptype = "roadmap", 
                color = "bw")

p.roadmap <- ggmap(pima) + 
  geom_polygon(data = pima.data, 
               aes(long, lat, group = group, 
                   fill = (span.total/speakers.total)*100, 
                   alpha = (span.total/speakers.total)*100), size = .25) + 
  coord_map(xlim=-c(111.2, 110.7), 
            ylim=c(32.1, 32.5)) + 
  theme(legend.position = "none")
p.roadmap

center

In the next post I will incorporate data from Twitter and step up the plots using plotly.

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] ggmap_2.6.1            UScensus2010tract_1.00 UScensus2010_0.11     
## [4] foreign_0.8-66         maptools_0.8-39        sp_1.2-2              
## [7] knitr_1.12.3           magrittr_1.5           ggplot2_2.0.0         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3         formatR_1.2.1       plyr_1.8.3         
##  [4] tools_3.2.3         digest_0.6.9        evaluate_0.8       
##  [7] nlme_3.1-124        gtable_0.1.2        lattice_0.20-33    
## [10] png_0.1-7           psych_1.5.8         DBI_0.3.1          
## [13] mapproj_1.2-4       rgdal_1.1-3         parallel_3.2.3     
## [16] proto_0.3-10        dplyr_0.4.3         stringr_1.0.0      
## [19] RgoogleMaps_1.2.0.7 rgeos_0.3-15        maps_3.0.2         
## [22] grid_3.2.3          R6_2.1.2            jpeg_0.1-8         
## [25] RJSONIO_1.3-0       reshape2_1.4.1      tidyr_0.4.1        
## [28] readr_0.2.2         scales_0.3.0        assertthat_0.1     
## [31] mnormt_1.5-3        geosphere_1.5-1     colorspace_1.2-6   
## [34] labeling_0.3        stringi_1.0-1       munsell_0.4.2      
## [37] broom_0.4.0         rjson_0.2.15