Jerid Francom bio photo

Jerid Francom

Associate Professor of Spanish and Linguistics
Romance Languages
Wake Forest University

Curriculum vitae

Email Twitter Github Stackoverflow Last.fm

The first release of the ACTIV-ES Spanish dialect corpus based on TV/film transcripts is now available on GitHub.

It includes 3,460,172 total tokens (Argentina: 1,103,039 Mexico: 976,192 Spain: 1,380,941) and comes in running text and word list (1:5 gram) formats. Each format has both a plain text and part-of-speech tagged version.

For more information about the development and evaluation of this resource you can download our paper “ACTIV-ES: a comparable cross-dialect corpus of everday Spanish from Argentina, Mexico, and Spain” at the Ninth Annual Language Resources and Evaluation Conference (LREC 2014)

Below is a visualization of the current language distribution (version 0.1) by time, country, size, and genre.

plot_country_year_genre