python vectorization text-processing cosine-similarity

Vectorizing LIst of Unique Words into 0 or 1 using Python

I am quite new into Python, and recently have to do on some text processing to do a cosine similarity between two text.

I have currently be able to do on the basic pre-processing on the text such as lowercase them, tokenizing the text, removing stopwords, and stemming of those words by using the NLTK library. And now, I have able to create a list of unique words from all the text files that I got.

Then now, in this list of unique words that I have created, there are only certain words that I would like to vectorize it to 1 (and the rest to 0) according to a textfile that I have.

So for example, after vectorized the list of unique words, it should look something like below:

awesome| best | carry | elephant | fly | home | irresponsible | implicit 
1      | 1    | 0     | 0        | 0   | 1    | 0             | 0

I have tried googling and look through stack overflow here, but it seems one of the the common solution is using the scikit learn - features extraction in converting the list. However, I only wants either 0 or 1... and that the 1 should be specified by a textfile.

For example, there is one textfile (after vectorizing it all into 1) that I would like to compute the similarity with this dictionary... So it should look something like this below:

Text_to_Compare.txt

awesome | fly | implicit
1       | 1   | 1

And then, I will compare the "Text_to_Compare.txt" with the list of unique words and compute the similarity result.

Could anyone kindly guide me on how do I go on vectorize the list of unique words to only 0 or 1, and vectorzing the "Text_to_Compare.txt" to all 1?

Thank you!

Solution

Is this what you wanted to do?

text_file = ['hello','world','testing']
term_dict = {'some':0, 'word':0, 'world':0}

for word in text_file:
    if word in term_dict:
        term_dict[word] = 1

If you've tokenized your file (.split() method in Python), then they will be available in a list. Assuming that you've normalized each term (lowered, stemmed, stripped of punctuation, etc.) in your dictionary and your text_file, then the above code should work. Just set all the values in your dict to 0, and loop your file, checking to see if the word is in the dict. If it is, then set that value to 1.

Here is how you can generate a dictionary with values set to 0:

new_dict = {word:0 for word in text_file}

It's a dictionary comprehension. Note again that my code assumes that you're normalizing all the terms -- comparing apples to apples -- and that's always key when working with text.

Final Edit. If you have two lists of unique terms (after tokenizing and normalizing)

def normalize(term):
    #do stuff -- i.e., lower; stem; strip punctuation; etc.
    pass
word_list_one = [normalize(word) for word in text_doc.split()]
word_list_two = [normalize(word) for word in other_text_doc.split()]

# if you know the longest of your lists, then you can create a dictionary of ones and zeros from the two lists.
word_dict = dict([(word,1) if word in word_list_one else (word,0) for word in word_list_two])
# that's it.  in the above code, word_list_two should be the longer of your two lists (assuming I understand your code properly)
# Note: someone with more python experience could definitely improve my code.  I just wanted show you another option.

Please let me know if this works for you. Hope it helps a little!