Search code examples
pythonpandasvectorization

How to optimize this iterable over a pandas dataframe


I have the following dataframe:

import pandas as pd
import re


d = {
    'I am a sentence of words.': 'words',
    'I am not a sentence of words.': 'words',
    'I have no sentence with words or punctuation': 'letter',
    'I am not a sentence with a letter or punctuation': 'letter'}

df = pd.Series(d).rename_axis('sentence').reset_index(name='mention')
                                           sentence mention
0                         I am a sentence of words.   words
1                     I am not a sentence of words.   words
2      I have no sentence with words or punctuation  letter
3  I am not a sentence with a letter or punctuation  letter

And I apply the following method to this for matching of various regex patterns:

def get_negated(row):
    negated = False
    
    # missed negation
    terms = ['neg',
             'negative',
             'no',
             'free of',
             'not',
             'without',
             'denies',
             'ruled out']
    
    for term in terms:
        regex_str=r"(?:\s+\S+)*\b{0}(?:\s+\S+)*\s+{1}\b".format(term, row.mention)
        if (re.search(regex_str, row['sentence'])): #or (re.search(regex_str2, row.sentence)):
            negated = True
            break
            
    return int(negated)

via iteration:

negated_terms=[]
for row in df.itertuples():
        negated_terms.append(get_negated(row))

and then add a new column to the dataframe via:

df['negated'] = negated_terms

with the following output:

df:

                                                sentence mention  negated
0                         I am a sentence of words.   words        0
1                     I am not a sentence of words.   words        1
2      I have no sentence with words or punctuation  letter        0
3  I am not a sentence with a letter or punctuation  letter        1

This works fine, but there are millions of rows in the dataframe and a few other methods that return other lists to create other new columns based on other regex patterns. As is, this is taking several hours to run. I was thinking of using the apply method to this to hopefully speed up the process, but given that there are multiple methods, I'm thinking this would actually be slower than my current implementation. I'm wondering if there is a more efficient (e.g., vectorized) method to speed this up. For the life of me, I haven't been able to find such a beast.


Solution

  • You can try this one :

    terms = [
        'neg', 'negative', 'no', 'free of',
        'not', 'without', 'denies','ruled out'
    ]
    
    pat = "(?:%s).+{mention}" % "|".join(map(re.escape, terms))
    
    df["negated"] = [
        int(bool(re.search(pat.format(mention=m), s)))
        for s,m in df[["sentence", "mention"]].to_numpy()
    ]
    

    Output :

    print(df)
    
                                               sentence mention  negated
    0                         I am a sentence of words.   words        0
    1                     I am not a sentence of words.   words        1
    2      I have no sentence with words or punctuation  letter        0
    3  I am not a sentence with a letter or punctuation  letter        1