Jerid Francom bio photo

Jerid Francom

Associate Professor of Spanish and Linguistics
Romance Languages
Wake Forest University

Curriculum vitae

Email Twitter Github Stackoverflow Last.fm

In this ExploRation, I will demonstrate how to scrape text data from the web with R. This particular example aims to collect a series of State of the Union (SOTU) speeches [1947-present] from http://www.presidency.ucsb.edu/ and write the plain-text contents to disc. The bulk of the work will be done with the recently released rvest package. The scripting will also employ the magrittr package for writing legible code.

To get started first we identify the sub-page ../sou.php that contains the links of interest.

center

This page contains links to pages in which all of the SOTU addresses. To load that page into R, as a parsed html object we use rvest’s html() function.

library("rvest")
# Load the page
main.page <- html(x = "http://www.presidency.ucsb.edu/sou.php")

Once we have the page, the next step is to identify how to isolate the links that we are interested in from other links on the page. The documentation for the package refers to Selectorgadget a bookmarklet for your browser that allows you to point-and-click your way to identifying either the CSS or XPATH need to get the target html objects.

Activating Selectorgadget, you then click on the html object you want and then see what becomes highlighted. In most cases this will highlight more objects that you want, so then you click again on the object(s) you do not want to isolate. In our case, clicking first the “2013” link in the SOTU listing and then the “Florida 2000” link leaves us with the right objects selected.

center

Now we can return to R, and use the CSS selector ‘.ver12 a’ to get our links. The html_nodes() function gets use the elements we want, but they come html-warts and all. For the URLs we use the html_attr() function and specify that we want the part contained under href (ex. <a href="http://www.presidency.ucsb.edu/ws/index.php?pid=29431">1790</a>). The same basic process is applied to get the link text, but instead we use the html_text() function to get the ‘1790’ part of the previous URL example. Then we combine the results into a data.frame sotu.

# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
  html_nodes(".ver12 a") %>% # get the CSS nodes
  html_attr("href") # extract the URLs
# Get link text
links <- main.page %>% # feed `main.page` to the next step
  html_nodes(".ver12 a") %>% # get the CSS nodes
  html_text() # extract the link text
# Combine `links` and `urls` into a data.frame
sotu <- data.frame(links = links, urls = urls, stringsAsFactors = FALSE)
head(sotu)
##   links                                                   urls
## 1  2013 http://www.presidency.ucsb.edu/ws/index.php?pid=102826
## 2  2014 http://www.presidency.ucsb.edu/ws/index.php?pid=104596
## 3  2015 http://www.presidency.ucsb.edu/ws/index.php?pid=108031
## 4  2009  http://www.presidency.ucsb.edu/ws/index.php?pid=85753
## 5  2010  http://www.presidency.ucsb.edu/ws/index.php?pid=87433
## 6  2011  http://www.presidency.ucsb.edu/ws/index.php?pid=88928

The results look great. We still need to extract only those addresses we are interested in, dates between 1947-2015. To do this we simply use the %in% operator to filter our sotu$links column by the vector 1947:2015.

sotu <- subset(x = sotu, links %in% 1947:2015) # Truman to Obama

The next step is to follow each these links, extract the text, and write the text to disc. To keep our files organized, we are going to dynamically generate the file names marking them as either republican or democrat by using the dates that Republicans held the presidency and then append the date. This will result in files with the format: republican-2001.txt.

First the filter: dates which Republicans were in office.

# Vector to mark SOTU address political party
republicans <- c(1953:1960, 1970:1974, 1974:1977, 1981:1988, 1989:1992, 2001:2008)

Now the aim is to loop through each of the links in our sotu data.frame (i.e. the number of rows nrow(sotu)), grabbing the parsed html (html()) and isolating (".displaytext") and extracting the relevant text (html_text). After the text has been scraped then we decide if the text should be marked Republican or Democrat using the previous filter and an ifelse() statement, compile the file name, and write that file to disc.

# Loop over each row in `sotu`
for(i in seq(nrow(sotu))) {
  text <- html(sotu$urls[i]) %>% # load the page
    html_nodes(".displaytext") %>% # isloate the text
    html_text() # get the text
  # Find the political party of this link
  party <- ifelse(test = sotu$links[i] %in% republicans,
                  yes = "republican", no = "democrat")
  # Create the file name
  filename <- paste0("texts/", party, "-", sotu$links[i], ".txt")
  sink(file = filename) %>% # open file to write 
  cat(text)  # write the file
  sink() # close the file
}

And that should do it. Looking at our directory we see that the files are now there and in order.

# View the `texts/` directory
dir(path = "texts", full.names = TRUE)
##  [1] "texts/democrat-1947.txt"   "texts/democrat-1948.txt"  
##  [3] "texts/democrat-1949.txt"   "texts/democrat-1950.txt"  
##  [5] "texts/democrat-1951.txt"   "texts/democrat-1952.txt"  
##  [7] "texts/democrat-1961.txt"   "texts/democrat-1962.txt"  
##  [9] "texts/democrat-1963.txt"   "texts/democrat-1964.txt"  
## [11] "texts/democrat-1965.txt"   "texts/democrat-1966.txt"  
## [13] "texts/democrat-1967.txt"   "texts/democrat-1968.txt"  
## [15] "texts/democrat-1969.txt"   "texts/democrat-1978.txt"  
## [17] "texts/democrat-1979.txt"   "texts/democrat-1980.txt"  
## [19] "texts/democrat-1993.txt"   "texts/democrat-1994.txt"  
## [21] "texts/democrat-1995.txt"   "texts/democrat-1996.txt"  
## [23] "texts/democrat-1997.txt"   "texts/democrat-1998.txt"  
## [25] "texts/democrat-1999.txt"   "texts/democrat-2000.txt"  
## [27] "texts/democrat-2009.txt"   "texts/democrat-2010.txt"  
## [29] "texts/democrat-2011.txt"   "texts/democrat-2012.txt"  
## [31] "texts/democrat-2013.txt"   "texts/democrat-2014.txt"  
## [33] "texts/democrat-2015.txt"   "texts/republican-1953.txt"
## [35] "texts/republican-1954.txt" "texts/republican-1955.txt"
## [37] "texts/republican-1956.txt" "texts/republican-1957.txt"
## [39] "texts/republican-1958.txt" "texts/republican-1959.txt"
## [41] "texts/republican-1960.txt" "texts/republican-1970.txt"
## [43] "texts/republican-1971.txt" "texts/republican-1972.txt"
## [45] "texts/republican-1974.txt" "texts/republican-1975.txt"
## [47] "texts/republican-1976.txt" "texts/republican-1977.txt"
## [49] "texts/republican-1981.txt" "texts/republican-1982.txt"
## [51] "texts/republican-1983.txt" "texts/republican-1984.txt"
## [53] "texts/republican-1985.txt" "texts/republican-1986.txt"
## [55] "texts/republican-1987.txt" "texts/republican-1988.txt"
## [57] "texts/republican-1989.txt" "texts/republican-1990.txt"
## [59] "texts/republican-1991.txt" "texts/republican-1992.txt"
## [61] "texts/republican-2001.txt" "texts/republican-2002.txt"
## [63] "texts/republican-2003.txt" "texts/republican-2004.txt"
## [65] "texts/republican-2005.txt" "texts/republican-2006.txt"
## [67] "texts/republican-2007.txt" "texts/republican-2008.txt"

A note is in order on isolating the text on each SOTU page. Selectorgadget is really handy, but in my experience it isn’t fool proof. If you cannot get the highlighting to work, you will need to open up the html page source and do some sleuthing. In Safari on OSX, you will need to enable “Show Develop in menu bar” and then you can choose “Show Web Inspector”. Perusing the html structure you need to use some trial and error to find the CSS selector(s) that work. After some poking around, .displaytext turns out to do the trick.

center

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] graphics  grDevices utils     datasets  methods   stats     base     
## 
## other attached packages:
## [1] rvest_0.2.0   EBImage_4.8.1 kfigr_1.0.2   knitr_1.9     Rdym_0.2.0   
## [6] ggplot2_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] abind_1.4-0         BiocGenerics_0.12.1 bitops_1.0-6       
##  [4] colorspace_1.2-4    digest_0.6.8        evaluate_0.5.5     
##  [7] formatR_1.0         grid_3.1.2          gtable_0.1.2       
## [10] httr_0.6.1          jpeg_0.1-8          lattice_0.20-29    
## [13] locfit_1.5-9.1      magrittr_1.5        MASS_7.3-37        
## [16] munsell_0.4.2       parallel_3.1.2      plyr_1.8.1         
## [19] png_0.1-7           proto_0.3-10        Rcpp_0.11.4        
## [22] RCurl_1.95-4.5      reshape2_1.4.1      scales_0.2.4       
## [25] selectr_0.2-3       stringr_0.6.2       tiff_0.1-5         
## [28] tools_3.1.2         XML_3.98-1.1