Search code examples
pythonpython-3.xoptimizationnltktokenize

NLTK Tokeninizing Optimization


I have an NLTK parsing function that I am using to parse a ~2GB text file of a TREC dataset. The goal for this dataset is tokenize the entire collection, perform some calculations (such as calculating TF-IDF weights, etc), and then to run some queries against our collection to use cosine similarity and return the best results.

As it stands, my program works but takes well over an hour (typically between 44-61 minutes) to run. The timing is broken down as follows:

TOTAL TIME TO COMPLETE: 4487.930628299713
TIME TO GRAB SORTED COSINE SIMS: 35.24157094955444
TIME TO CREATE TFIDF BY DOC: 57.06743311882019
TIME TO CREATE IDF LOOKUP: 0.5097501277923584
TIME TO CREATE INVERTED INDEX: 2.5217013359069824
TIME TO TOKENIZE: 4392.5711488723755

So obviously, the tokenization is accounting for ~98% of the time. I am looking for a way to speed that up.

The tokenization code is below:

def remove_nums(arr): 
    pattern = '[0-9]'  
    arr = [re.sub(pattern, '', i) for i in arr]    
    return arr


def get_words(para):   
    stop_words = list(stopwords.words('english'))    
    words = RegexpTokenizer(r'\w+')
    lower = [word.lower() for word in words.tokenize(para)]
    nopunctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower]
    no_integers = remove_nums(nopunctuation)
    dirty_tokens = [data for data in no_integers if data not in stop_words]
    tokens = [data for data in dirty_tokens if data.strip()]

def driver(file):
   myfile = get_input(file)
    p = r'<P ID=\d+>.*?</P>'       
    paras = RegexpTokenizer(p)   
    document_frequency = collections.Counter()   
    collection_frequency = collections.Counter()   
    all_lists = []    
    currWordCount = 0   
    currList = [] 
    currDocList = []
    all_doc_lists = []
    num_paragraphs = len(paras.tokenize(myfile))  


    print()
    print(" NOW BEGINNING TOKENIZATION ")
    print()
    for para in paras.tokenize(myfile):             
        group_para_id = re.match("<P ID=(\d+)>", para)
        para_id = group_para_id.group(1)       
        tokens = get_words(para)
        tokens = list(set(tokens))     
        collection_frequency.update(tokens)      
        document_frequency.update(set(tokens))       
        para = para.translate(str.maketrans('', '', string.punctuation))     
        currPara = para.lower().split()      
        for token in tokens:          
            currWordCount = currPara.count(token)          
            currList = [token, tuple([para_id, currWordCount])]          
            all_lists.append(currList)

            currDocList = [para_id, tuple([token, currWordCount])]
            all_doc_lists.append(currDocList)

    d = {}
    termfreq_by_doc = {}    
    for key, new_value in all_lists:       
        values = d.setdefault(key, [])       
        values.append(new_value)

    for key, new_value in all_doc_lists:
        values = termfreq_by_doc.setdefault(key, [])
        values.append(new_value)

I am pretty new to optimization, and am looking for some feedback. I did see this post which condemns a lot of my list comprehensions as "evil", but I can't think of a way around what I am doing.

The code is not well commented, so if for some reason it is not understandable, that is okay. I see other questions on this forum re: speeding up NLTK tokenization without a lot of feedback, so I am hoping for a positive thread about tokenization optimization programming practices.


Solution

  • By: https://codereview.stackexchange.com/users/25834/reinderien

    On: https://codereview.stackexchange.com/questions/230393/tokenizing-sgml-text-for-nltk-analysis

    Regex compilation

    If performance is a concern, this:

    arr = [re.sub(pattern, '', i) for i in arr]
    

    is a problem. You're re-compiling your regex on every function call - and every loop iteration! Instead, move the regex to a re.compile()d symbol outside of the function.

    The same applies to re.match("<P ID=(\d+)>", para). In other words, you should be issuing something like

    group_para_re = re.compile(r"<P ID=(\d+)>")
    

    outside of the loop, and then

    group_para_id = group_para_re.match(para)
    

    inside the loop.

    Premature generator materialization

    That same line has another problem - you're forcing the return value to be a list. Looking at your no_integers usage, you just iterate over it again, so there's no value to holding onto the entire result in memory. Instead, keep it as a generator - replace your brackets with parentheses.

    The same thing applies to nopunctuation.

    Set membership

    stop_words should not be a list - it should be a set. Read about its performance here. Lookup is average O(1), instead of O(n) for a list.

    Variable names

    nopunctuation should be no_punctuation.