Search code examples
pythonpandasdataframevectorization

Vectorization of column search by dynamic value in pandas


I am starting to learn Pandas. And which day I can not solve the fastest way to calculate. How to get for each row the value of a column by a unique name, composed of column 'a', 'b' values?

Below is an example of the initial data.

index a b a1_b1_name a1_b1_foo_bar a2_b1_name a2_b1_foo_bar a1_b2_name a1_b2_foo_bar a2_b2_name a2_b2_foo_bar a1_b3_name a1_b3_foo_bar a2_b3_name a2_b3_foo_bar
0 1 2 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12
1 2 1 value13 value14 value15 value16 value17 value18 value19 value20 value21 value22 value23 value24
2 2 2 value25 value26 value27 value28 value29 value30 value31 value32 value33 value34 value35 value36
3 1 1 value37 value38 value39 value40 value41 value42 value43 value44 value45 value46 value47 value48
4 2 3 value49 value50 value51 value52 value53 value54 value55 value56 value57 value58 value59 value60

The number of columns with the values "a _b _name " is planned to be much larger, about 40. The number of rows will be in the tens of thousands.

I need to create a new column 'name' based on the data of the table as quickly as possible and preferably without loops, using the power of pandas vectorization.

Like this one:

index name foo_bar
0 value5 value6
1 value15 value16
2 value31 value32
3 value37 value38
4 value59 value60

I was only able to do this by looping through the columns. But it takes more time than I'd like:

for col in df.columns:
     df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])

Solution

  • original question

    Cf. first version of the question

    This is a variant on an indexing lookup, you first need to pre-process your input columns a/b to match the column names:

    target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)+'_name'
    
    idx, cols = pd.factorize(target)
    
    out = pd.DataFrame({'index': df['index'],
                        'values': df.reindex(cols, axis=1).to_numpy()
                                  [np.arange(len(df)), idx],
                        })
    
    # or, for a new column in the original DataFrame
    # df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
    

    Output:

       index   values
    0      0   value3
    1      1   value8
    2      2  value16
    3      3  value19
    4      4  value30
    

    Intermediate target:

    0    a1_b2_name
    1    a2_b1_name
    2    a2_b2_name
    3    a1_b1_name
    4    a2_b3_name
    dtype: object
    

    multiple columns:

    One option is to reshape and merge:

    target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)
    
    tmp = df.drop(columns=['index', 'a', 'b'])
    tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)
    
    out = (df
       .reset_index()
       .merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
       .set_index('index')[['name', 'foo']]
    )
    

    Output:

              name      foo
    index                  
    0       value5   value6
    1      value15  value16
    2      value31  value32
    3      value37  value38
    4      value59  value60