I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:
Here I have a matrix with shape (n, length)
, where each row represents a sentence composed by length
words. So there are n
sentences with same length in all. Then with a defined context size, e.g., window_size = 5
, I want to calculate the co-occurrence matrix D
, where the entry in the cth
row and wth
column is #(w,c)
, which means the number of times that a context word c
appears in w
's context.
An example can be referred here. How to calculate the co-occurrence between two words in a window of text?
I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix
So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.
It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
for i,word in enumerate(sentence):
for j in range(max(i-window,0),min(i+window,length)):
m[word,sentence[j]]+=1
for sentence in X:
cal_occ(sentence, m)