I have created a small test corpus:
words = ["he she why fun", "you are why it", "believe it or stop", 'hello goodbye it', 'i goodbye']
print(len(words))
I am trying to create a dictionary with keys as unique words, and the values as the document they came from. So I created this routine:
count = 0
while count < len(words):
for word in words[count].split():
p = " ".join(words[0:count]) + " " + " ".join(words[count+1:len(words)])
if word not in p.split():
dc[word] = count
count += 1
print(dc)
{'he': 0, 'she': 0, 'fun': 0, 'you': 1, 'are': 1, 'believe': 2, 'or': 2, 'stop': 2, 'hello': 3, 'i': 4}
This works, but it's clunky. Is there some way to use a count vectorizer, TF-IDF, or some Spacy function perhaps that can do this? I'm also concerned about readability, i.e. the dictionary format doesn't look very good.
You can simplify this by just collecting things into a set and dropping things which are already in the set.
dc = dict()
seen = set()
for index, sentence in enumerate(words):
for word in sentence.split():
if word in seen:
if word in dc:
del dc[word]
else:
seen.add(word)
dc[word] = index
print(dc)
I suppose you could try to conflate the set with the dict but I'm thinking having two separate variables is cleaner and probably more efficient for nontrivial amounts of data.
Notice also the use of enumerate
to keep track of where you are in a loop over items.