python python-3.x nlp text-processing information-retrieval

Transforming Text To Vector

I have a dictionary having words and the frequency of each words.

{'cxampphtdocsemployeesphp': 1,
'emptiness': 1, 
'encodingundefinedconversionerror': 1, 
'msbuildexe': 2,
'e5': 1, 
'lnk4049': 1,
'specifierqualifierlist': 2, .... }

Now I want to create a bag of words model using this dictionary( I don't want to use standard library and function. I want to apply this using the algorithm.)

Find N most popular words in the dictionary and numerate them. Now we have a dictionary of the most popular words.
For each title in the dictionary create a zero vector with the dimension equals to N.
For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

I have my text which I will use to create the vector using a function.

The function would look like this,

def my_bag_of_words(text, words_to_index, dict_size):
"""
    text: a string
    dict_size: size of the dictionary

    return a vector which is a bag-of-words representation of 'text'
"""


 Let say we have N = 4 and the list of the most popular words is 

['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this: 

{'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:
'hi how are you'

For this text we create a corresponding zero vector 
[0, 0, 0, 0]

And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:
'hi':  [1, 0, 0, 0]
'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
'are': [1, 0, 0, 1]
'you': [1, 1, 0, 1]

The resulting vector will be 
[1, 1, 0, 1]

Any help in applying this would be really helpful. I am using python for implementation.

Thanks,

Neel

Solution

You need to first calculate corpus frequency for each term, for your case for each word and keep them in a frequency dictionary. Let's say cherry happens to occur 78 times in your corpus cheery --> 78 you need to keep. Then sort your frequency dictionary descending by frequency values, then keep first N pairs.

Then, for your enumeration you may keep a dictionary as an index. For instance, cherry --> term2 for index dictionary.

Now, an incidence matrix needed to be prepared. It will be vectors of documents, like this:

doc_id   term1 term2 term3 .... termN
doc1       35     0    23         1
doc2        0     0    13         2
   .        .     .     .         .
docM        3     1     2         0

Each document(text, title, sentence) in your corpus needs to have an id or index as well as listed above. It is time to create a vector for a document. Iterate through your documents and get terms by tokenizing them, you have tokens per document. Iterate through tokens, check if next token exists in your frequency dictionary. If true, update your zero vector by using your index dictionary and frequency dictionary.

Let's say doc5 has cherry and we have it in our first N popular terms. Get its frequency (it was 78) and index (it was term5). Now update zero vector of doc5:

doc_id   term1 term2 term3 .... termN
doc1       35     0    23         1
doc2        0     0    13         2
   .        .     .     .         .
doc5        0    78     0         0 (under process)

You need to do this for each token against all popular terms for every document in your corpus.

At the end you will end up with a NxM matrix, which contains vectors of M documents in your corpus.

I can suggest you to look at IR-Book. https://nlp.stanford.edu/IR-book/information-retrieval-book.html

You may think using a tf-idf based matrix instead of corpus frequency-based term incidence matrix as they propose as well.

Hope this post helps,

Cheers