python multiprocessing - text processing

I am trying to create a multiprocessing version of text categorization code i found here (amongst other cool things). I've appended the full code below.

I've tried a couple of things - tried a lambda function first, but it complained of not being serializable (!?), so attempted a stripped down version of the original code:

  negids = movie_reviews.fileids('neg')
  posids = movie_reviews.fileids('pos')

  p = Pool(2)
  negfeats =[]
  posfeats =[]

  for f in negids:
   words = movie_reviews.words(fileids=[f]) 
   negfeats = p.map(featx, words) #not same form as below - using for debugging

  print len(negfeats)

Unfortunately even this doesnt work - i get the following trace:

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
    return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value
ZeroDivisionError: float division

Any idea what i might be doing wrong? should i be using pool.apply_async instead (in of itself that doesnt seem to solve the problem either - but perhaps i am barking up the wrong tree) ?

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()

Solution

Regarding your stripped down version, are you using a different featx function than the one used in http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/?

The exception most probably happens inside featx and multiprocessing just re-raises it, though it does not really include the original traceback which makes it a bit unhelpful.

Try running it without pool.map() first (i.e. negfeats = [feat(x) for x in words]) or include something in featx that you can debug.

If that still doesn't help, post the whole script you are working on in your original question (simplified already if possible) so others can run that and provide more directed answers. Note that the following code fragment actually works (adapting your stripped down version):

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
    return dict([(word, True) for word in words])

if __name__ == "__main__":
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    p = Pool(2)
    negfeats =[]
    posfeats =[]

    for f in negids:
        words = movie_reviews.words(fileids=[f]) 
        negfeats = p.map(featx, words)

    print len(negfeats)