The first release of the ACTIV-ES Spanish dialect corpus based on TV/film transcripts is now available on GitHub.
It includes 3,460,172 total tokens (Argentina: 1,103,039 Mexico: 976,192 Spain: 1,380,941) and comes in running text and word list (1:5 gram) formats. Each format has both a plain text and part-of-speech tagged version.
For more information about the development and evaluation of this resource you can download our paper “ACTIV-ES: a comparable cross-dialect corpus of everday Spanish from Argentina, Mexico, and Spain” at the Ninth Annual Language Resources and Evaluation Conference (LREC 2014)
Below is a visualization of the current language distribution (version 0.1) by time, country, size, and genre.