python information-retrieval inverted-index

Storing an inverted index

I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?). Please suggest an efficient way of storing and searching through the index.

My dictionary structure is as follows (using nested dictionaries)

{word : {doc1:[locations], doc2:[locations], ....}}

so that I can get the documents containing a word by dictionary[word].keys() ... and so on.

Solution

shelve

At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).

Yes it does bring it all in.

Is that a problem? If it's not an actual problem, then stick with it.

If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?