I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
else:
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
markovWindowSize1File.write("\n")
markovWindowSize1File.write("\n")
This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.
I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.
Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.
Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?
Thanks.
Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
else:
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
print(lineIndex)
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
markovPairsFile.write(keyText)
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
markovFrequencyFile.write(valueText)
As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)