I’m currently working on a project involving Twitter posts and demographics. One of the best resources for demographic information in the US is the census. Having not worked with US census data in a very long time, I was excited to see that there is an R package available to make the process easier.
In this exploRation I will provide a tutorial on 1) how to acquire the US census data and other demographic data through American Fact Finder, 2) how to visualize the data in regional choropleths, 3) how to overlay geo-tagged tweets, and finally 4) how to display the map as an interactive plot.
In this post I’ll tackle the first two points and leave Twitter data and interactive plots for a follow up post.
A handy package for working with US Census data in R is the
UScensus2010 package. It is available on CRAN and can be installed in the normal way.
It is a helper package to interface the spatial and demographic data that is available in a series of other packages dedicated to varying political and statistical regions: namely
In this tutorial I’ll be working with the “tract” level [US Census Tract description]. A tract is a small subdivision of a county. In this particular case I want to explore the Tucson, Arizona metropolitain area. County would be too wide, and city boundaries too narrow.
UScensus2010 provides an installer fuction
install.tract() which should install the tract data on your machine. I found problems, however, with the installer, and had to do some poking around on the web. Luckily, I found a repository where the data can be manually downloaded and installed.
With the tract data downloaded and installed, it can be loaded using the
county() function. Again, I want county-level information here. Tucson belongs to “Pima” county. [I’ve specified
level = "tract" to get access to the tract data that we installed, but if you have downloaded other UScensus package data you can specify the which you want to pull here.]
The resulting data in
pima.tract is a SpatialPolygonsDataFrame grouping demographic data and polygon coordinates by tract id.
A map can be generated in various ways. First, base R’s
plot() function will produce a quick and dirty view of the tracts for Pima county.
UScensus2010 also provides various plotting functions. Of these the
choropleth.ssplot() function is an easy way to generate a choropleth. Using the defaults, the “Total population” (P0010001) variable is used to fill the tracts. For more information on demographic variables
If you are like me, I am more comfortable working with
ggplot2. To work with this data, however, it needs to be converted to a data.frame using the
tidy() function from the
broom package. Below is a function that carries out the conversion from
sp object to
Thanks to Andy Bush for the heads up on switching to
Then we create the data.frame version of the data.
Let’s produce a simple choropleth with
P0010001 (“Total population”) as the fill aesthetic as a above.
The county plot is extremely dense around the area of interest –the Tucson metropolitain area. With
ggplot2 we can subset this plot with
coord_map(). But to use this we are going to need the coordinates for the bounding box. I found a great site which provides a easy-peasy interface for doing just such a thing.
There is an extensive amount of demographic information that is available in the
sp object that
county() creates. But there is much more information available on the American FactFinder site. I am working with a project which aims to look at language use information so I selected the tract information for Pima county on from the American Community Survey (ACS) program’s 2014 5-year estimates and downloaded the “AGE BY LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER” (B16007) data.
The second row of the csv file contains helpful descriptions of the variables, but I dropped it before loading it into my R session.
Now what I want to do is join the
pima.tract.df and the new
pima.lang.df datasets. The common key between both sets is the
$fips (Federal Information Processing Standard) codes. Yet in the
pima.lang.df data this is listed as
$GEO.id2. Furthermore, in the
$fips variable is of type character, and needs to be numeric for direct merging.
To look at the Spanish-speaking population, I pull the estimates for Spanish-speaking 5 to 17, 18 to 64, and 65 plus and add them together and then divide this vector by the total estimate of speakers in the tract.
We can use this information to fill the choropleth and convert the vector scores into percentages per tract.
Getting a bit more fancy we can overlay this plot on a roadmap generated by Google.
In the next post I will incorporate data from Twitter and step up the plots using