Search code examples
python-3.xnlpcountvectorizertfidfvectorizer

Better way to identify words that are unique to each document in a corpus


I have created a small test corpus:

words = ["he she why fun", "you are why it", "believe it or stop", 'hello goodbye it', 'i goodbye']
print(len(words))

I am trying to create a dictionary with keys as unique words, and the values as the document they came from. So I created this routine:

count = 0
while count < len(words):
    for word in words[count].split():
        p = " ".join(words[0:count]) + " " + " ".join(words[count+1:len(words)])
        if word not in p.split():
            dc[word] = count
    count += 1

print(dc)



{'he': 0, 'she': 0, 'fun': 0, 'you': 1, 'are': 1, 'believe': 2, 'or': 2, 'stop': 2, 'hello': 3, 'i': 4}

This works, but it's clunky. Is there some way to use a count vectorizer, TF-IDF, or some Spacy function perhaps that can do this? I'm also concerned about readability, i.e. the dictionary format doesn't look very good.


Solution

  • You can simplify this by just collecting things into a set and dropping things which are already in the set.

    dc = dict()
    seen = set()
    for index, sentence in enumerate(words):
        for word in sentence.split():
            if word in seen:
                if word in dc:
                    del dc[word]
            else:
                seen.add(word)
                dc[word] = index
    
    print(dc)
    

    I suppose you could try to conflate the set with the dict but I'm thinking having two separate variables is cleaner and probably more efficient for nontrivial amounts of data.

    Notice also the use of enumerate to keep track of where you are in a loop over items.