i have a huge corpus of data in my my text file that i want to train for skip gram model. i have split the data from file into list now i want to count the words with their number of occurrence and make a dictionary ,give the word as key to the dictionary and frequency as the value.here is a snippet of my code
with open("enwik8","r") as data:
words=data.read().split()
vocabulary_size = 5000
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
i have succesfully made a list with the words and their frequency upto first most common 50000 words,now i need to feed them to dictionary,key as a word and value as freq.
dictionary = dict()
for word, _ in count:
can anyone help me through??
Assuming you have already a list of words, here is how you draw dictionary out of it as per your need:
word_dict = dict()
for word_count in words:
if word_count[0] not in word_dict:
word_dict[word_count[0]] = word_count[1]
your list contains tuples, so word_dict[word_count[0]]
, so I am placing first item of tuple that is word as a key
in dictionary and second item word_count[1]
in tuple which is count as value
to that key