Search code examples
pythonpandasnlpmultiprocessingdata-cleaning

How to use multiprocessing to pre-process a pandas dataframe in for loop in Python?


I have a dataset of 8500 rows of text. I want to apply a function pre_process on each of these rows. When I do it serially, it takes about 42 mins on my computer:

import pandas as pd
import time
import re

### constructing a sample dataframe of 10 rows to demonstrate
df = pd.DataFrame(columns=['text'])
df.text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

def pre_process(text):
    '''
    function to pre-process and clean text
    '''
    stop_words = ['in', 'of', 'at', 'a', 'the']

    # lowercase
    text=str(text).lower()

    # remove special characters except spaces, apostrophes and dots
    text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text)

    # remove stopwords
    text=[word for word in text.split(' ') if word not in stop_words]

    return ' '.join(text)

t = time.time()
for i in range(len(df)):
    df.text[i] = pre_process(df.text[i])

print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

>>> Time taken for pre-processing the data = 41.95724259614944 mins

So, I want to make use of multiprocessing for this task. I took help from here and wrote the following code:

import pandas as pd
import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

def func(text):
    return pre_process(text)

t = time.time()
results = pool.map(func, [df.text[i] for i in range(len(df))])
print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

pool.close()

But the code just keeps on running, and doesn't stop.

How can I get it to work?


Solution

  • you can use pandas.DataFrame.apply

    df.text= df.text.apply(pre_process)