Recent Posts

More Posts

In this last post dedicated to acquiring data for language research with R I discuss strategies for scraping language from the public-facing web. The rvest and tidyverse packages will do the heavy lifting but to put this software into practice we need to get up to speed with the language of the web: HTML.


This is the second of three posts dedicated to acquiring for language research with R. I will cover connecting to web service APIs with R packages. I will also discuss R vectors and data frames in more detail.


In this post, I will provide an overview of the first of three common strategies for acquiring corpus data in R: accessing corpus data from data repositories and individual sites. I will cover acquiring data from different sources and introduce you to the R code that will help speed the process, maintain consistency in our data, and set the stage for a reproducible workflow.


In this Recipe you will learn about the types of data available for language research and where to find data. The goal, then, is to introduce you to the landscape of language data available and provide a general overview of the characteristics of language data from a variety of sources providing you with resources to begin your own quantitative investigations.


In this post I will cover some of these topics including the importance of identifying a research question, how different statistical approaches relate to different types of research, and understanding data from a sampling and organizational standpoint. I will also provide some examples of linking research questions with variables in a toy dataset as we begin to discuss how to approach data analysis, primarily through visualization techniques.




The ACTIV-ES project is the ongoing development of a comparable Spanish corpus comprised of tv/film dialogue from Argentine, Mexican and Spanish productions.

Data Science for Language Research

Upcoming textbook aimed at introducing the fundamental concepts and practical code for applying Data Science in language research.


At Wake Forest University I teach both upper-division undergraduate courses for the Spanish major/minor, core and elective courses for the Linguistics program, and graduate courses in the Translation and Interpreting Studies MA program.

I have taught the following courses:


  • 111. Elementary Spanish I
  • 113. Intensive Elementary Spanish
  • 153. Intermediate Spanish
  • 154. Accelerated Intermediate Spanish
  • 309. Spanish Grammar and Composition
  • 322. Spanish Pronunciation and Dialect Variation
  • 329. Introduction to Hispanic Linguistics


  • 150. Introduction to Linguistics
  • 330. Psycholinguistics and Language Acquisition
  • 380-680. Language Use and Technology
  • 383-683. Language Engineering: Localization and Terminology