I am quite new into Python, and recently have to do on some text processing to do a cosine similarity between two text.
I have currently be able to do on the basic pre-processing on the text such as lowercase them, tokenizing the text, removing stopwords, and stemming of those words by using the NLTK library. And now, I have able to create a list of unique words from all the text files that I got.
Then now, in this list of unique words that I have created, there are only certain words that I would like to vectorize it to 1 (and the rest to 0) according to a textfile that I have.
So for example, after vectorized the list of unique words, it should look something like below:
awesome| best | carry | elephant | fly | home | irresponsible | implicit
1 | 1 | 0 | 0 | 0 | 1 | 0 | 0
I have tried googling and look through stack overflow here, but it seems one of the the common solution is using the scikit learn - features extraction in converting the list. However, I only wants either 0 or 1... and that the 1 should be specified by a textfile.
For example, there is one textfile (after vectorizing it all into 1) that I would like to compute the similarity with this dictionary... So it should look something like this below:
Text_to_Compare.txt
awesome | fly | implicit
1 | 1 | 1
And then, I will compare the "Text_to_Compare.txt" with the list of unique words and compute the similarity result.
Could anyone kindly guide me on how do I go on vectorize the list of unique words to only 0 or 1, and vectorzing the "Text_to_Compare.txt" to all 1?
Thank you!
Is this what you wanted to do?
text_file = ['hello','world','testing']
term_dict = {'some':0, 'word':0, 'world':0}
for word in text_file:
if word in term_dict:
term_dict[word] = 1
If you've tokenized your file (.split()
method in Python), then they will be available in a list. Assuming that you've normalized each term (lowered, stemmed, stripped of punctuation, etc.) in your dictionary and your text_file, then the above code should work. Just set all the values in your dict to 0, and loop your file, checking to see if the word is in
the dict
. If it is, then set that value to 1.
Here is how you can generate a dictionary with values set to 0:
new_dict = {word:0 for word in text_file}
It's a dictionary comprehension. Note again that my code assumes that you're normalizing all the terms -- comparing apples to apples -- and that's always key when working with text.
Final Edit. If you have two lists of unique terms (after tokenizing and normalizing)
def normalize(term):
#do stuff -- i.e., lower; stem; strip punctuation; etc.
pass
word_list_one = [normalize(word) for word in text_doc.split()]
word_list_two = [normalize(word) for word in other_text_doc.split()]
# if you know the longest of your lists, then you can create a dictionary of ones and zeros from the two lists.
word_dict = dict([(word,1) if word in word_list_one else (word,0) for word in word_list_two])
# that's it. in the above code, word_list_two should be the longer of your two lists (assuming I understand your code properly)
# Note: someone with more python experience could definitely improve my code. I just wanted show you another option.
Please let me know if this works for you. Hope it helps a little!