I have a large pandas dataframe with one column "sentence" that contains text (each entry might be around 100 words and there are around 200000 entries). I would like to make a dictionary with all the text appearing in this column where the keys are the words and the values are the absolute frequencies. My attempt was to write the following function:
def word_counter(text):
text_as_list = text.split()
d = {w: text_as_list.count(w) for w in set(text_as_list)}
return dict(sorted(d.items()))
So I just have to call the function with the right argument:
word_counter(' '.join(df["sentence"]))
Execution time seems endless, I waited for 20 minutes and it was still running. I think my solution is not bad (I have not made a time complexity analysis but it looks very "pythonic" to me so I guess is not a bad solution). I was wondering if there is a better way to do this or maybe if this specific task could be one of those cases appropriate for multiprocessing/multithreading? Which one would be more appropriate in this situation? Why?
I would appreciate a solution based on ProcessPoolExecutor/ThreadPoolExecutor since is much easier for me to understand.
Thank you very much in advance.
Kind regards.
collections.Counter
should be faster than your implementation and itertools.chain.from_iterable
can help you avoid creating a single long string and split list in favor of small iterable chunks. Put it all together and you get
>>> import pandas as pd
>>> import collections
>>> import itertools
>>>
>>> df = pd.DataFrame({"sentence":["three", "two three", "one two three"]})
>>> counts = collections.Counter(itertools.chain.from_iterable(line.split() for line in df["sentence"]))
>>> counts = dict(sorted(counts.items()))
>>> counts
{'one': 1, 'three': 3, 'two': 2}