python python-3.x nlp nltk information-retrieval

Replace all collocations in a text file with a dictionary of collocations in python

I'm trying to replace substrings in a text file [corpus.txt] with some other substrings[collocation|ngram] using python. I have the list of possible substrings in a file sub.txt containing the following:

dogs chase
birds eat
chase birds
chase cat
chase birds .

and a corpus.txt containing some texts as below:

dogs chase cats around
dogs bark
cats meow
dogs chase birds
cats chase birds , birds eat grains
dogs chase the cats
the birds chirp

with the desired output

<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds 
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp

And my python code with multiprocessing (used multiprocessing due to the size of the corpus and sub)

import sys
import string
import time
from multiprocessing import Pool
import re
import itertools
flatten = itertools.chain.from_iterable

#corpus_dir =  sys.argv[1]
#ngram_dir = sys.argv[2]

#f = open(corpus_dir) # Open file on read mode
#corpus = f.read().split("\n") # Create a list containing all lines
#f.close() # Close file

#f2 = open(ngram_dir) # Open file on read mode
#sub = f2.read().split("\n") # Create a list containing all lines
#f2.close() # Close file

sub = ['dogs chase', 'birds eat', 'chase birds', 'chase cat', 'chase birds .']
corpus = [' dogs chase cats around ', ' dogs bark ', ' cats meow ', ' dogs chase birds ', ' cats chase birds , birds eat grains ', ' dogs chase the cats ', ' the birds chirp ']
print("The corpus has ", len(corpus))


sbsx = { " "+ng+" " : " <bop> "+ng+" <eop> " for ng  in sub }
def multiple_replace(string, rep_dict):
     pattern = re.compile("|".join([re.escape(k) for k in sorted(rep_dict,key=len,reverse=True)]), flags=re.DOTALL)
     print("replaced = ")
     return pattern.sub(lambda x: rep_dict[x.group(0)], string)

def f(a_list):
    out = [multiple_replace(sent, sbsx) for sent in a_list]
    '''out = []
    for sent in a_list:
      c = multiple_replace(sent, sbsx)
      out.append(c)
      #print(c)
      time.sleep(0.01)
'''
    return out

def f_amp(a_list):
    #chunks = [a_list[i::5] for i in range(5)]
    chunks = [a_list[x:x+5] for x in range(0, len(a_list), 5)]
    print(len(chunks))

    pool = Pool(processes=10)

    result = pool.map_async(f, chunks)

    while not result.ready():
        print("Running...")
        time.sleep(0.5)

    return list(flatten(result.get()))


final_anot = f_amp(corpus)
print(final_anot)

I added already initialized corpus and sub variable (in the snippet above) to show how the code works. Both corpus.txt and sub.txt contains millions (200M+ and 4M+ respectively) of lines in the actual setting. I needed a code that can do the task efficiently and I have tried Multiprocessing with pool but it is going to take weeks to complete. Are there other efficient and fast ways to go about the task??

Solution

You are recompiling your pattern for every sentence. This takes a fair amout of time. Instead you can compile your pattern globally once:

sbsx = { " "+ng+" " : " <bop> "+ng+" <eop> " for ng  in sub }
pattern = re.compile("|".join([re.escape(k) for k in sorted(sbsx,key=len,reverse=True)]), flags=re.DOTALL)

def multiple_replace(string):
     print("replaced = ")
     return pattern.sub(lambda x: sbsx[x.group(0)], string)

I tested this with using your sample sentence 1 Million times and I went from 52 seconds to only 13 seconds.

I hope I did not miss anything and this will help speed up your code.