Search code examples
machine-learningdatasetinformation-retrievalsupervised-learning

(Query, Document, Relevance) free dataset for building an information retrieval system


I'm interested on finding a data set like "English Relevance Judgements File List": http://trec.nist.gov/data/qrels_eng

This dataset contains a labelled, pairs of queries and documents. However, it depends on a nonfree corpus, called "Data - English Documents": http://trec.nist.gov/data/docs_eng.html

Do you know any free dataset(s) similar this one?

Side-note: The dataset will be used in a research project for building an information retrieval system based on neural networks.


Solution

  • You have confused several TREC collections in your question. ClueWeb09 and the document sets pointed to by trec.nist.gov/data/docs_eng.html are all separate document sets. That is, each document set has its own distinct topics (queries) and relevance judgments, which are not part of the document set distribution.

    There are dozens of different TREC text retrieval test collections. The collections that are available are listed on the TREC Data page (trec.nist.gov/data.html) organized by the TREC track that they were created in. They are organized this way because the collections are generally targeted to support the retrieval problem that that track was designed to support.

    In general, the queries and relevance judgments can be downloaded directly from the TREC site. The document sets usually must be purchased: the document sets are either copyrighted by the original source and must be licensed or there is other significant expense associated with collecting/distributing the document set. Some of the old TREC document sets you can obtain for free if you participate in TREC (though that is not an option any more for this year). A few document sets are free, though most still require a Data Use agreement to be signed. The Genomics track had an ad hoc search task and its document set is free subject to a Data Use agreement. See http://trec.nist.gov/data/genomics.html .

    The University of Glasgow maintains a page that points to other available test collections, some of which are free, at http://ir.dcs.gla.ac.uk/resources/test_collections/ . Most of these are pre-TREC (pre-1992) collections, which are very tiny by today's standards. ("Tiny" as in you will probably find paper reviewers highly skeptical of results demonstrated only on small collections.)

    Ellen Voorhees, TREC project manager, NIST