Search code examples
pythonpandasoptimizationnlp

How to optimize this nested loop code dealing with pandas dataframes


I am new to optimization and need help improving the run time of this code. It accomplishes my task, but it takes forever. Any suggestions on improving it so it runs faster?

Here is the code:

def probabilistic_word_weighting(df, lookup):

    # instantiate new place holder for class weights for each text sequence in the df
    class_probabilities = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
    for index, row in lookup.iterrows():
        if row.word in df.words.split():
            class_proba_ = row.class_proba.strip('][').split(', ')
            class_proba_ = [float(i) for i in class_proba_]
            class_probabilities = [a + b for a, b in zip(class_probabilities, class_proba_)]

    return class_probabilities

The two input df's look like this:

df

index                                     word
1                               i  havent  been  back 
2                                            but  its 
3                   they  used  to  get  more  closer 
4                                             no  way 
5       when  we  have  some  type  of  a  thing  for
6                and  she  had  gone  to  the  doctor 
7                                                suze 
8        the  only  time  the  parents  can  call  is
9               i  didnt  want  to  go  on  a  cruise 
10                            people  come  aint  got 

lookup

index    word                               class_proba
6231    been    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
8965    havent  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
3270    derive  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7817    a       [0.0, 0.0, 7.451379, 6.552, 0.0, 0.0, 0.0, 0.0]
3452    hello   [0.0, 0.0, 0.0, 0.0, 0.000155327, 0.0, 0.0, 0.0]
5112    they    [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, 0.0]
1012    time    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7468    some    [0.000193199, 0.0, 0.0, 0.000212947, 0.0, 0.0, 0.0, 0.0]
6428    people  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487
5537    scuba   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487

What its doing is essentially iterating through each row in lookup, which contains a word and its relative class weights. If the word is found in any text sequence in df.word, then the class_probabilities for lookup.word get added to the class_probabilities variable assigned to each sequence in df.word. Its looping through every row in df for every iteration on lookup rows.

How can this be done faster?


Solution

  • IIUC, you are using df.apply with your function, but you can do it like this. the idea is not to redo the operation on the rows of lookup each time you find a corresponding word but doing it once and reshape df to be able to perform vectorized manipulation

    1: reshape the column words of df with str.split, stack and to_frame to get a new line for each word:

    s_df = df['words'].str.split(expand=True).stack().to_frame(name='split_word')
    print (s_df.head(8))
        split_word
    0 0          i
      1     havent
      2       been
      3       back
    1 0        but
      1        its
    2 0       they
      1       used
    

    2: Reshape lookup by set_index the word column, str.strip, str.split and astype to get a dataframe with word as index and each value of class_proba in a column

    split_lookup = lookup.set_index('word')['class_proba'].str.strip('][')\
                         .str.split(', ', expand=True).astype(float)
    print (split_lookup.head())
              0    1         2      3         4    5    6         7
    word                                                           
    been    0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
    havent  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
    derive  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
    a       0.0  0.0  7.451379  6.552  0.000000  0.0  0.0  0.000000
    hello   0.0  0.0  0.000000  0.000  0.000155  0.0  0.0  0.000000
    

    3: Merge both, drop the unnecessary column and groupby the level=0 being the original index of df and sum

    df_proba = s_df.merge(split_lookup, how='left',
                          left_on='split_word', right_index=True)\
                   .drop('split_word', axis=1)\
                   .groupby(level=0).sum()
    print (df_proba.head())
              0    1         2         3    4    5    6         7
    0  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0  10.55799
    1  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
    2  0.000000  0.0  0.000323  0.000000  0.0  0.0  0.0   0.00000
    3  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
    4  0.000193  0.0  7.451379  6.552213  0.0  0.0  0.0   0.00000
    

    4: finally, convert to a list and reassign to the original df with to_numpy and tolist:

    df['class_proba'] = df_proba.to_numpy().tolist()
    print (df.head())
                                               words  \
    0                          i  havent  been  back   
    1                                       but  its   
    2              they  used  to  get  more  closer   
    3                                        no  way   
    4  when  we  have  some  type  of  a  thing  for   
    
                                             class_proba  
    0   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.55798974]  
    1           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
    2  [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, ...  
    3           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
    4  [0.000193199, 0.0, 7.451379, 6.552212946999999...