Search code examples
pythonpandasdataframeapplypython-applymap

How to use apply for two pandas column including lists to return index in a list in one column using the element in another column?


I have a pandas Dataframe with columns of "a" and "b". Column a has a list of values as a column value, and column "b" has a list with a single value that might appear in column "a". I want to create a new column c based on column a and b that has the value of position of element in b that appears in column a values using apply. (c: (index of b in a)+1 ) column b is always a list with one element or no element at all, column a can be in any length, but if it is empty, column b would be empty as well. column b element is expected to be in column a and I just want to find the position of first occurrence of it in column a.

a                         b                   c 


['1', '2', '5']          ['2']                2

['2','3','4']            ['4']                3
['2','3','4']            []                   0
[]                       []                   0
...

I wrote a for loop which works fine but it is pretty slow:

for i in range(0,len(df)):

    if len(df['a'][i])!=0:
        df['c'][i]=df['a'][i].index(*df['b'][i])+1 
    else:
        df['c'][i]=0

But I want to use apply to make it faster, the following does not work, any thoughts or suggestion would greatly be appreciated?

df['c']=df['a'].apply(df['a'].index(*df['b']))


Solution

  • First of all, here is a basic method using .apply().

    import pandas as pd
    import numpy as np
    
    list_a = [['1', '2', '5'], ['2', '3', '4'], ['2', '3', '4'], []]
    list_b = [['2'], ['4'], [], []]
    
    df_1 = pd.DataFrame(data=zip(list_a, list_b), columns=['a', 'b'])
    
    df_1['a'] = df_1['a'].map(lambda x: x if x else np.NaN)
    df_1['b'] = df_1['b'].map(lambda x: x[0] if x else np.NaN)
    #df_1['b'] = df_1['b'].map(lambda x: next(iter(x), np.NaN))
    
    
    def calc_c(curr_row: pd.Series) -> int:
        if curr_row['a'] is np.NaN or curr_row['b'] is np.NaN:
            return 0
        else:
            return curr_row['a'].index(curr_row['b'])
    
    
    df_1['c'] = df_1[['a', 'b']].apply(func=calc_c, axis=1)
    

    df_1 result:

        a                  b    c
    --  ---------------  ---  ---
     0  ['1', '2', '5']    2    1
     1  ['2', '3', '4']    4    2
     2  ['2', '3', '4']  nan    0
     3  nan              nan    0
    

    I replaced the empty lists with NaN, I find it far more idiomatic and practical.

    This is obviously not an ideal solution, I will try to find something else. Obviously, the more information we have about your program and the DataFrame, the better.