Written by Ivan Bercovich
on September 30, 2024

MessIRve

Together with Francisco Valentini, Viviana Cotik, Damián Ariel Furman, Edgar Altszyler, PhD, Juan Manuel Pérez, we’re happy to announce the release of MessIRve, a new large-scale IR dataset in Spanish!

MessIRve contains around 730k queries from 20 Spanish-speaking countriesand the United States, with relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets.

The dataset is available in HuggingFace!

Queries and relevance judgments: spanish-ir/messirve
The collection of documents: spanish-ir/eswiki_20240401_corpus
Queries and qrels in TREC format: spanish-ir/messirve-trec

For more details, check out our arXiv paper: MessIRve: A Large-Scale Spanish Information Retrieval Dataset

We hope MessIRve serves to spur more work in IR for the Spanish language and facilitate the development of efficient information access tools for Spanish speakers.

MessIRve means works for me in Spanish (“me sirve”). The reference to Lionel Messi, player of the most popular sport in Spanish-speaking countries, football, stresses the importance of using topics that are relevant to Spanish speakers.

← → Top