Search code examples
performancepython-2.7nested-loopsfeature-selection

Condense nested for loop to improve processing time with text analysis python


I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:

    features = [0 for i in xrange(len(dictionary))]
    for bgrm in new_scored:
        for i in xrange(len(dictionary)):
            if bgrm[0] == dictionary[i]:
                features[i] = int(bgrm[1])
                break

I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.

The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.

I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.

Thank you in advance!


Solution

  • Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.

    If you refactor dictionary that way, then the loop can be rewritten as:

    features = [0 for key in dictionary]
    for bgram in new_scored:
        try:
            features[dictionary[bgram[0]]] = int(bgrm[1])
        except KeyError:
            # do something if the bigram is not in the dictionary for some reason
    

    This should convert what was an O(n) traversal through dictionary into a hash lookup.

    Hope this helps.