Search code examples
pythondata-structuresinformation-retrievalinverted-index

Inverted Index where I can save a tuple of the word along with an id of where it came from


I have created the following class to implement an inverted index in Python. I read questions from the quora question pair challenge. The questions are in this form:

---------------------------
qid  |question         
---------------------------
  1  |Why do we exist?
  2  |Is there life on Mars?
  3  |What happens after death?
  4  |Why are bananas yellow?

The problem is that I want the qid to get passed along with each word inside the inverted index so that I know after it gets created which question each word comes from, and access it easily.

class Index:
    """ Inverted index datastructure """

    def __init__(self):
        self.index = defaultdict(list)
        self.documents = {}
        self.__unique_id = 0


    def lookup(self, word):
        """
        Lookup a word in the index
        """
        word = word.lower()
        if self.stemmer:
            word = self.stemmer.stem(word)

        return [self.documents.get(id, None) for id in self.index.get(word)]


    def addProcessed(self, words):
        """
        Add a document string to the index
        """
        for word in words:
            if self.__unique_id not in self.index[word]:
                self.index[word].append(self.__unique_id)

        self.documents[self.__unique_id] = words
        self.__unique_id += 1

How could I implement this in my above data structure?


Solution

  • A straightforward way to get qid into your index is to write Index.addProcessed to receive qid as a second argument and include that in the value set for unique_id key in the documents.

    def addProcessed(self, words, qid):
        #...
        self.documents[self.__unique_id] = (words, qid)
        self.__unique_id += 1
    

    Index.lookup will then return a list of tuples consisting of words and their question id.