Search code examples
textnlpdatasourcesemanticscorpus

Where can I find texts that describe topic-specific events?


So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.

So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).

With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more along the lines of chess commentary than a chess forum, for example).

While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.

Since I plan on building my own corpus out of this text, no tagging is necessary.


Solution

  • The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/

    You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference