In this ExploRation, I will demonstrate how to scrape text data from the web with R. This particular example aims to collect a series of State of the Union (SOTU) speeches [1947-present] from http://www.presidency.ucsb.edu/ and write the plain-text contents to disc. The bulk of the work will be done with the recently released rvest package. The scripting will also employ the magrittr package for writing legible code.
To get started first we identify the sub-page ../sou.php that contains the links of interest.
This page contains links to pages in which all of the SOTU addresses. To load that page into R, as a parsed html object we use rvest’s html() function.
Once we have the page, the next step is to identify how to isolate the links that we are interested in from other links on the page. The documentation for the package refers to Selectorgadget a bookmarklet for your browser that allows you to point-and-click your way to identifying either the CSS or XPATH need to get the target html objects.
Activating Selectorgadget, you then click on the html object you want and then see what becomes highlighted. In most cases this will highlight more objects that you want, so then you click again on the object(s) you do not want to isolate. In our case, clicking first the “2013” link in the SOTU listing and then the “Florida 2000” link leaves us with the right objects selected.
Now we can return to R, and use the CSS selector ‘.ver12 a’ to get our links. The html_nodes() function gets use the elements we want, but they come html-warts and all. For the URLs we use the html_attr() function and specify that we want the part contained under href (ex. <a href="http://www.presidency.ucsb.edu/ws/index.php?pid=29431">1790</a>). The same basic process is applied to get the link text, but instead we use the html_text() function to get the ‘1790’ part of the previous URL example. Then we combine the results into a data.frame sotu.
The results look great. We still need to extract only those addresses we are interested in, dates between 1947-2015. To do this we simply use the %in% operator to filter our sotu$links column by the vector 1947:2015.
The next step is to follow each these links, extract the text, and write the text to disc. To keep our files organized, we are going to dynamically generate the file names marking them as either republican or democrat by using the dates that Republicans held the presidency and then append the date. This will result in files with the format: republican-2001.txt.
First the filter: dates which Republicans were in office.
Now the aim is to loop through each of the links in our sotu data.frame (i.e. the number of rows nrow(sotu)), grabbing the parsed html (html()) and isolating (".displaytext") and extracting the relevant text (html_text). After the text has been scraped then we decide if the text should be marked Republican or Democrat using the previous filter and an ifelse() statement, compile the file name, and write that file to disc.
And that should do it. Looking at our directory we see that the files are now there and in order.
A note is in order on isolating the text on each SOTU page. Selectorgadget is really handy, but in my experience it isn’t fool proof. If you cannot get the highlighting to work, you will need to open up the html page source and do some sleuthing. In Safari on OSX, you will need to enable “Show Develop in menu bar” and then you can choose “Show Web Inspector”. Perusing the html structure you need to use some trial and error to find the CSS selector(s) that work. After some poking around, .displaytext turns out to do the trick.