Fast way to do word co_occurrence count in python on big corpus

I am trying to make a co_occurrence count work fast in python.

I have a list of context windows. Which is a list of words in some way connected by a dependency label after I parsed the corpus with a transition parser. Each context window has in average length 3. So a context window is not simply the 10 words before and after the focus word. That is why I need a list of lists. And why I can't just use a list of words.

I also have a dictionary where every distinct word of the corpus has a distinct index as value. But I don't think that using that speeds it up a lot. Right now I am only working with a small test corpus with 50,000 words and about 300,000 context windows my code is fast enough for that but I need it to be a lot faster because I need it to work on a much larger corpus in the end.

class CoOc:
    def __init__(self):
        self.co_occurrence = Counter()

    def add_count(self, token_index, context, token_to_index):
        try:
            # some words are not in the token dictionary if they dont appear frequent enough
            context_index = token_to_index[context]
        except KeyError:
            context_index = -1
        if context_index != -1:
            self.co_occurrence[(token_index, context_index)] += 1

    def count_co_occurrence(self, context_chains, token_index_dic):
        token_index_list = []
        for key in token_index_dic:
            token_index_list.append([key, token_index_dic[key]])
        time1 = current_milli_time()
        for token_key in token_index_list[0:100]:
            token = token_key[0]
            token_index = token_key[1]
            for context_chain in context_chains:
                if token in context_chain:
                    for context_word in context_chain:
                        if token != context_word:
                            self.add_count(token_index, context_word, token_index_dic)
        time2 = current_milli_time()
        print(time2-time1)

So in the count function from all the tokens, which are stored in the token_index dictionary, I make a list containing only a small section of it to check the time it takes.

I tried to loop through the list of all the context windows to create a list of context windows only consisting of indices before I loop through the tokens so I would not even need the add count function and just replace the line where I call add_count with the last line of add_count but that did not speed up the process.

Like this it takes 3 seconds to run, for only 100 focus words. I need it to work a lot faster to be working on the corpus I want to use it on. Is that even possible in python or do I need to implement it in another language?

So the first 10 context windows look like this:

['Gerücht', 'drücken', 'Kurs']
['Konkursgerüchte', 'drücken', 'Kurs']
['Kurs', 'Aktie']
['Kurs', 'Amazon-Aktie']
['der', 'Amazon-Aktie']
['der', 'Aktie']
['Begleitet', 'Gerücht', 'Konkurs']
['Begleitet', 'Marktgerüchten', 'Konkurs']
['begleiten', 'Gerücht', 'Konkurs']
['begleiten', 'setzt fort', 'Aktie', 'Talfahrt']

Each context word appears in the token_index_dic with some index unless it doesn't appear frequent enough in the corpus.

The first 10 elements of the token_index_list look like this:

['Browser-gestützte', 0]
['Akzeptanzstellen', 1]
['Nachschubplanung', 2]
['persönlichen', 3]
['Unionsfraktion', 4]
['Wired', 21122]
['Hauptfigur', 6]
['Strafgesetz', 7]
['Computer-Hersteller', 8]
['Rückschläge', 9]

and then when I print self.co_occurrence it looks like this:

# (focus_word_index, context_word_index): count
Counter({(1, 9479): 11, (1, 20316): 7, (2, 1722): 7, (2, 20217): 7, (2, 19842): 7, (2, 2934): 7, (3, 11959): 7, (3, 2404): 7, (3, 1105): 7, (4, 12047): 7, (4, 19262): 7, (0, 13585): 4, (1, 18525): 4, (1, 1538): 4, (1, 9230): 4, (1, 20606): 4, (1, 1486): 4, (2, 13166): 4, (2, 6948): 4, (2, 23028): 4, (2, 14051): 4, (3, 3854): 4, (3, 7908): 4, (3, 4902): 4, (3, 13222): 4, (4, 23737): 4, (4, 6785): 4, (4, 18718): 4, (5, 15424): 4, (5, 4394): 4, (5, 21534): 4, (5, 5829): 4, (5, 6513): 4, (6, 23331): 4, (6, 7234): 4, (6, 20606): 4, (6, 22951): 4, (6, 7318): 4, (6, 15332): 4, (9, 21183): 4, (9, 23028): 4, (9, 1572): 4, (1, 25031): 3, (1, 5829): 3, (1, 14458): 3, (3, 14387): 3, (3, 9574): 3, (3, 21061): 3, (4, 18143): 3, (4, 3306): 3, (4, 17798): 3, (4, 2250): 3, (5, 9982): 3, (5, 5999): 3, (6, 15727): 3, (0, 16008): 2, (0, 1304): 2, (0, 5210): 2, (0, 17798): 2, (1, 20000): 2, (1, 19326): 2, (1, 3820): 2, (1, 25173): 2, (1, 21843): 2, (2, 20662): 2, (3, 7521): 2, (3, 14479): 2, (3, 8109): 2, (3, 18311): 2, (4, 2556): 2, (5, 23940): 2, (5, 1823): 2, (5, 18733): 2, (6, 3131): 2, (6, 947): 2, (6, 18540): 2, (6, 4756): 2, (6, 2743): 2, (6, 7692): 2, (6, 20263): 2, (6, 8670): 2, (6, 2674): 2, (6, 20050): 2, (6, 13274): 2, (6, 17876): 2, (6, 7538): 2, (6, 11098): 2, (6, 4296): 2, (6, 2417): 2, (6, 21421): 2, (6, 19256): 2, (6, 16739): 2, (7, 10908): 2, (7, 4439): 2, (7, 9492): 2, (8, 7027): 2, (8, 599): 2, (8, 4439): 2, (9, 16030): 2, (9, 6856): 2, (9, 24320): 2, (9, 15978): 2, (1, 6454): 1, (1, 14482): 1, (1, 2643): 1, (1, 7338): 1, (2, 21061): 1, (2, 1486): 1, (4, 4296): 1, (4, 23940): 1, (4, 5775): 1, (5, 24133): 1, (5, 2743): 1, (5, 11472): 1, (5, 19336): 1, (5, 20606): 1, (5, 2740): 1, (5, 9479): 1, (5, 14200): 1, (6, 22175): 1, (6, 13104): 1, (6, 10435): 1, (6, 1891): 1, (6, 22353): 1, (6, 4852): 1, (6, 20943): 1, (6, 23965): 1, (6, 13494): 1, (7, 1300): 1, (7, 12497): 1, (7, 2788): 1, (8, 13879): 1, (8, 2404): 1})

Solution

Using multiprocessing and cython. First thing that has to be done is to change the list of contextchains into a list of lists of indexes. Each list of indexes has to have the same length X. Each index is the index of a word in the vocabulary. So for each distinct word in a contextchain exists a distinct index in the vocabulary. For the contextchains that have a smaller length then X, additional indexes have to be added. These additional indexes should all be the same and bigger then the biggest index in the vocabulary. After that's done cython and multiprocessing can be used. For this I just generate a random list of contextchains and I assume a vocabulary of size 32000. I have 8 cores on my machine so I split the vocabulary into 8 slices but more slices would work too.

On the python side.

this generates the contextchains and the vocabindexlist slices and calls the function dosomething() in parallel.

def runcoocinparallel():
    index_lists = []
    context_list = []
    for ab in range(300000):
        list2 = []
        for ac in range(3):
            list2.append(random.randint(0, 32000-1))
        for ac in range(7):
            list2.append(50000)
        context_list.append(list2)
    for j in range(8):
        indexlist = []
        for i in range(4000):
            indexlist.append(j*4000+i)
        index_lists.append(indexlist)
    time1 = helper.current_milli_time()
    with concurrent.futures.ProcessPoolExecutor() as executor:
        a = executor.map(dosomething, index_lists, [context_list]*8)
    print(helper.current_milli_time()-time1)

in this function the call to the cython makecount function happens. Could be eliminated.

def dosomething(index_list, context_list):
    firstindex = index_list[0]
    index_list_len = len(index_list)
    array_cy.makecount(firstindex, index_list_len, 8*index_list_len, context_list)

On the cython side:

# is supposed to speed up the indexing.
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
cpdef makecount(int first_index, int indizes_slice_len, int indizes_len, context_list):
    # allocating memory space for the arrays.
    # this is the array where the count is stored for each possible combination
    cdef int *context_list_cy = <int *> malloc(10* context_list_len*sizeof(int))
    cdef int *combinations = <int *> malloc(indizes_slice_len*indizes_len*sizeof(int))
    # type definition so cython does not check the type each loop.
    cdef int context_list_len = len(context_list)
    cdef int i
    cdef int j
    cdef int k
    cdef int context_index
    # setting the counts to zero
    for i in range(indizes_slice_len*indizes_len):
        combinations[i] = 0
    # creating an cython 1D-Array from the list of contextchains
    for i in range(context_list_len):
        for j in range(10):
            context_list_cy[i*10+j] = context_list[i][j]
    # this is where all the time is spent
    time1 = helper.current_milli_time()
    for i in range(first_index, first_index+indizes_slice_len):
        for j in range(context_list_len):
            # checking if focus_index i is inside this contextchain 
            if i == context_list_cy[j * 10] or i == context_list_cy[j * 10 + 1] or i == context_list_cy[j * 10 + 2] or \
               i == context_list_cy[j * 10 + 3] or i == context_list_cy[j * 10 + 4] or \
               i == context_list_cy[j * 10 + 5] or i == context_list_cy[j * 10 + 6] or \
               i == context_list_cy[j * 10 + 7] or i == context_list_cy[j * 10 + 8] or i == context_list_cy[j * 10 + 9]:
                # if so, increase the count of every valid context_index
                for k in range(10):
                        context_index = context_list_cy[j*10+k]
                        if not (i == context_index or context_index == 50000):
                            combinations[(i-first_index)*indizes_len+context_index] += 1
    print(helper.current_milli_time()-time1)

Note: for bigger context_chain_lists the RAM-Size becomes a problem.