I wrote a piece of code but I am not sure if we can get rid of the loops and vectorize it to make it faster. Can you please give suggestions? I am just updating the co-occurence matrix .
M = np.zeros((num_words,num_words))
word2Ind = {words[i]:i for i in range(len(words))}
for document in corpus:
for i,word in enumerate(document):
for j in range(i - window_size ,i + window_size + 1):
if i != j and j >= 0 and j <= len(document) - 1:
M[word2Ind[document[i]],word2Ind[document[j]]] += 1
You could at least, since the only thing you use word2ind
for is in pieces word2int[document[?]]
start with computing index for your document once for all, and then work from those index
M = np.zeros((num_words,num_words))
word2Ind = {words[i]:i for i in range(len(words))}
for document in corpus:
IX=[word2Ind[d] for d in document]
for i,word in enumerate(document):
for j in range(i - window_size ,i + window_size + 1):
if i != j and j >= 0 and j <= len(document) - 1:
M[IX[i], IX[j]] += 1
It becomes then easier to slighly vecorize
M = np.zeros((num_words,num_words))
word2Ind = {words[i]:i for i in range(len(words))}
for document in corpus:
IX=np.array([word2Ind[d] for d in document], dtype=np.uint32)
for j in range(1 , window_size + 1):
if j==0: continue
M[IX[:-j], IX[j:]] += 1
M[IX[j:], IX[:-j]] += 1