I have a huge database of forum data. I need to extract corpora from the database for NLP purposes. The extracting step has parameters (for example FTS queries), and I'd like to save the corpus with the parameter metadata on the file system.
Some corpora will be dozens of megabytes large. What is the best way of saving a file with it's metadata, so that I can read the metadata without loading the entire file.
I am using the following technologies which might be relevant : PyQt, Postgres, Python, NLTK.
Some notes:
I guess I could pickle the metadata to string and have the first line of the file represent the metadata. This seems to be the simplest way I think. that is, if the pickle format is ASCII-safe.
In the terminology of the NLTK, a "corpus" is the whole collection, and can consist of multiple files. Sounds like you can store each forum session (what you call a "corpus") into a separate file, using a structured format that allows you to store metadata in the beginning of the file.
The NLTK generally uses XML for this purpose, but it's not hard to roll your own corpus reader that reads a file header and then defers to PlainTextCorpusReader
, or whatever standard reader best fits your file format. If you use XML, you'll also have to extend XMLCorpusReader
and provide methods sents()
, words()
, etc.