python dataframe multithreading multiprocessing text-processing

Processing dataframe using multiprocess/multithread

I have a large pandas dataframe with one column "sentence" that contains text (each entry might be around 100 words and there are around 200000 entries). I would like to make a dictionary with all the text appearing in this column where the keys are the words and the values are the absolute frequencies. My attempt was to write the following function:

def word_counter(text):
    text_as_list = text.split()
    d = {w: text_as_list.count(w) for w in set(text_as_list)}
    return dict(sorted(d.items()))

So I just have to call the function with the right argument:

word_counter(' '.join(df["sentence"]))

Execution time seems endless, I waited for 20 minutes and it was still running. I think my solution is not bad (I have not made a time complexity analysis but it looks very "pythonic" to me so I guess is not a bad solution). I was wondering if there is a better way to do this or maybe if this specific task could be one of those cases appropriate for multiprocessing/multithreading? Which one would be more appropriate in this situation? Why?

I would appreciate a solution based on ProcessPoolExecutor/ThreadPoolExecutor since is much easier for me to understand.

Thank you very much in advance.

Kind regards.

Solution

collections.Counter should be faster than your implementation and itertools.chain.from_iterable can help you avoid creating a single long string and split list in favor of small iterable chunks. Put it all together and you get

>>> import pandas as pd
>>> import collections
>>> import itertools
>>> 
>>> df = pd.DataFrame({"sentence":["three", "two three", "one two three"]})
>>> counts = collections.Counter(itertools.chain.from_iterable(line.split() for line in df["sentence"]))
>>> counts = dict(sorted(counts.items()))
>>> counts
{'one': 1, 'three': 3, 'two': 2}