Language Resources

One of my research interests is to develop new language resources that can be used for research in linguistics and NLP. Here you can find some of them.

Corpora

DSL Corpus Collection (DSLCC)
Description: Collection of journalistic corpora written in closely related languages and language varieties. The dataset has been used in the DSL Shared Tasks in 2014, 2015, 2016, and 2017.
Info: [pdf]
Link: DSLCC

Colonia: Corpus of Historical Portuguese
Description: Portuguese historical corpus comprising texts from the 16th to the early 20th, lemmatized and annotated with POS information. The corpus is available to download and through a graphical CQPWeb-based interface.
From May 2014, thanks to Diana Santos (University of Oslo), Colonia is also available at Linguateca.
From October 2014, thanks to Eckhard Bick (University of Southern Denmark), a version of Colonia tagged using the PALAVRAS parsing system is available through CorpusEye.
From August 2017, thanks to Rachael Tatman, Colonia is available at Kaggle.
Info: [pdf] [pdf]
Links: 1) Colonia Website, 2) Colonia at Linguateca, 3) Colonia at CorpusEye, 4) Colonia at Kaggle

Word Lists, Frequency Lists, etc.

Word Unigram frequency list from comparable Spanish corpora.
Description: This frequency list was produced to compare linguistic features of four Spanish varieties (Spain, Argentina, Peru and Mexico) as described in my 2013 paper.
Info: [pdf]
Link: Word Unigram List

POS bigram frequency list from comparable Spanish corpora.
Description: Frequency list produced on annotated data to compare linguistic features of four Spanish varieties (Spain, Argentina, Peru and Mexico) as described in my 2013 paper.
Info: [pdf]
Link: POS and Morphology

P-AWL: Portuguese Academic Word List.
Description: The P-AWL was developed for Portuguese using the English Academic Word List (AWL) developed by Coxhead (2000). It contains 1812 entries.
Info: [pdf]
Link: P-AWL

Last Updated: November 2017