Search code examples
pythonpandasnlpnltklemmatization

How to avoid lemmatizing already lemmatized sentences of a row in pandas dataframe for speedup


Given:

A simple and small pandas dataframe as follows:

df = pd.DataFrame(
    {
        "user_ip":       ["u7", "u3", "u1", "u9", "u4","u8", "u1", "u2", "u5"],
        "raw_sentence":  ["First sentence!", np.nan, "I go to school everyday!", "She likes chips!", "I go to school everyday!", "This is 1 sample text!", "She likes chips!", "This is the thrid sentence.", "I go to school everyday!"],
    }
  )

    user_ip    raw_sentence
0   u7         First sentence!
1   u3         NaN
2   u1         I go to school everyday! 
3   u9         She likes chips!
4   u4         I go to school everyday!     <<< duplicate >>>
5   u8         This is 1 sample text!
6   u1         She likes chips!             <<< duplicate >>>
7   u2         This is the thrid sentence.
8   u5         I go to school everyday!     <<< duplicate >>>

Goal:

I wonder if I could possibly avoid calling map or consider any other strategies for those rows with duplicated (exact similar) sentences in raw_sentence column. My intention is to speedup my implementation for bigger sized pandas dataframe (~100K rows).

[Inefficient] Solution:

Right now, I take advantage of .map() using lambda which goes through each row and call get_lm() function to retrieves lemmas of raw input sentences as follows:

import nltk
nltk.download('all', quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words('english')
wnl = nltk.stem.WordNetLemmatizer()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

def get_lm(input_sent:str="my text!"):
    tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
    lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
    return lms

df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')

    user_ip     raw_sentence                    lemma
0   u7          First sentence!                 [first, sentence]         <<< 1st occurrence => lemmatization OK! >>>
1   u3          NaN                             NaN                       <<< ignone None using na_action='ignore' >>>
2   u1          I go to school everyday!        [go, school, everyday]    <<< 1st occurrence => lemmatization OK! >>>
3   u9          She likes chips!                [like, chip]              <<< 1st occurrence => lemmatization OK! >>>
4   u4          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>
5   u8          This is 1 sample text!          [sample, text]            <<< 1st occurrence => lemmatization OK! >>>
6   u1          She likes chips!                [like, chip]              <<< already lemmatized, no need to do it again >>>
7   u2          This is the thrid sentence.     [thrid, sentence]         <<< 1st occurrence => lemmatization OK! >>>
8   u5          I go to school everyday!        [go, school, everyday]    <<< already lemmatized, no need to do it again >>>

Is there any more efficient approach to fix this issue?

Cheers,


Solution

  • Don't reinvent the wheel, use functools.cache:

    from functools import cache
    
    @cache
    def get_lm(input_sent:str="my text!"):
        tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
        lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)] 
        return lms
    
    df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')
    

    Output:

      user_ip                 raw_sentence                   lemma
    0      u7              First sentence!       [first, sentence]
    1      u3                          NaN                     NaN
    2      u1     I go to school everyday!  [go, school, everyday]
    3      u9             She likes chips!            [like, chip]
    4      u4     I go to school everyday!  [go, school, everyday]
    5      u8       This is 1 sample text!          [sample, text]
    6      u1             She likes chips!            [like, chip]
    7      u2  This is the thrid sentence.       [thrid, sentence]
    8      u5     I go to school everyday!  [go, school, everyday]