In this post I will provide an overview of the process of taking raw text and meta-data and organizing them into a tidy data set; that is, a tabular data format where each row is an observation and each column a corresponding attribute of the data.
In this last post dedicated to acquiring data for language research with R I discuss strategies for scraping language from the public-facing web. The `rvest` and `tidyverse` packages will do the heavy lifting but to put this software into practice we need to get up to speed with the language of the web: HTML.
This is the second of three posts dedicated to acquiring for language research with R. I will cover connecting to web service APIs with R packages. I will also discuss R vectors and data frames in more detail.
In this post, I will provide an overview of the first of three common strategies for acquiring corpus data in R: accessing corpus data from data repositories and individual sites. I will cover acquiring data from different sources and introduce you to the R code that will help speed the process, maintain consistency in our data, and set the stage for a reproducible workflow.
In this Recipe you will learn about the types of data available for language research and where to find data. The goal, then, is to introduce you to the landscape of language data available and provide a general overview of the characteristics of language data from a variety of sources providing you with resources to begin your own quantitative investigations.
In this post I will cover some of these topics including the importance of identifying a research question, how different statistical approaches relate to different types of research, and understanding data from a sampling and organizational standpoint. I will also provide some examples of linking research questions with variables in a toy dataset as we begin to discuss how to approach data analysis, primarily through visualization techniques.
The third post in the Recipe series, I provide an overview of and steps for the organization of a scalable data science project. This will include details on how to set up an R project, organizing scripts, data, and reports, and touch on various best practices which will lead to an efficient workflow and set the stage for a portable and easily reproducible research project.