One of my research interests is to create new language resources that can be used for research in linguistics and NLP. Here you can find some of them.
If you use any of these resources in your research, please refer to its respective description paper available in pdf.
A Portuguese historical corpus containing texts from the 16th to the early 20th century, lemmatized and annotated with POS tags. The corpus is available to download and through a graphical CQPWeb-based interface. From May 2014, thanks to Diana Santos (University of Oslo), Colonia is also available at Linguateca. From October 2014, thanks to Eckhard Bick (University of Southern Denmark), a version of Colonia tagged using the PALAVRAS parsing system is available through CorpusEye. From August 2017, thanks to Rachael Tatman, Colonia is available at Kaggle.
DSL Corpus Collection (DSLCC) DSLCC pdf
A collection of journalistic corpora written in closely related languages and language varieties. The dataset has been used in the DSL Shared Tasks in 2014, 2015, 2016, and 2017.
NLI-PT: A Portuguese Native Language Identification Dataset NLI-PT pdf
A collection of 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish.
LIDIOMS: A Multilingual Linked Idioms Data Set in Five Different Languages LIDIOMS pdf
This is a multilingual linked idioms data set in five different languages (English, Portuese, Italian, German, Russian). Currently being expanded to other languages.
Frequency lists from comparable Spanish corpora Word Unigrams POS and Morphology pdf
These two frequency lists were produced to compare linguistic features of four Spanish varieties (Argentina, Mexico, Peru, and Spain) as described in my 2013 paper.