(named) entity recognition

GenoVarDis

The corpus consists of (i) the translation and manual curation of documents with tmVar3 annotations (Wei et al., 2022), which include PubMed summaries, to which associated diseases and symptoms were added; and (ii) the manual annotation of PubMed summaries in Spanish.

MultiCoNER-ES

MULTICONER is a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions.

DIANN-2018-ES

The corpus is a collection of 500 abstracts from Elsevier journal papers related to the biomedical domain collected between 2017 and 2018. It is divided into two disjoined parts: training set (80%) and test set (20%). It is annotated with disabilities and negations and their scope.