Vectorization of column search by dynamic value in pandas

I am starting to learn Pandas. And which day I can not solve the fastest way to calculate. How to get for each row the value of a column by a unique name, composed of column 'a', 'b' values?

Below is an example of the initial data.

index	a	b	a1_b1_name	a1_b1_foo_bar	a2_b1_name	a2_b1_foo_bar	a1_b2_name	a1_b2_foo_bar	a2_b2_name	a2_b2_foo_bar	a1_b3_name	a1_b3_foo_bar	a2_b3_name	a2_b3_foo_bar
0	1	2	value1	value2	value3	value4	value5	value6	value7	value8	value9	value10	value11	value12
1	2	1	value13	value14	value15	value16	value17	value18	value19	value20	value21	value22	value23	value24
2	2	2	value25	value26	value27	value28	value29	value30	value31	value32	value33	value34	value35	value36
3	1	1	value37	value38	value39	value40	value41	value42	value43	value44	value45	value46	value47	value48
4	2	3	value49	value50	value51	value52	value53	value54	value55	value56	value57	value58	value59	value60

The number of columns with the values "a _b _name " is planned to be much larger, about 40. The number of rows will be in the tens of thousands.

I need to create a new column 'name' based on the data of the table as quickly as possible and preferably without loops, using the power of pandas vectorization.

Like this one:

index	name	foo_bar
0	value5	value6
1	value15	value16
2	value31	value32
3	value37	value38
4	value59	value60

I was only able to do this by looping through the columns. But it takes more time than I'd like:

for col in df.columns:
     df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])

Solution

original question

Cf. first version of the question

This is a variant on an indexing lookup, you first need to pre-process your input columns a/b to match the column names:

target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)+'_name'

idx, cols = pd.factorize(target)

out = pd.DataFrame({'index': df['index'],
                    'values': df.reindex(cols, axis=1).to_numpy()
                              [np.arange(len(df)), idx],
                    })

# or, for a new column in the original DataFrame
# df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Output:

   index   values
0      0   value3
1      1   value8
2      2  value16
3      3  value19
4      4  value30

Intermediate target:

0    a1_b2_name
1    a2_b1_name
2    a2_b2_name
3    a1_b1_name
4    a2_b3_name
dtype: object

multiple columns:

One option is to reshape and merge:

target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)

tmp = df.drop(columns=['index', 'a', 'b'])
tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)

out = (df
   .reset_index()
   .merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
   .set_index('index')[['name', 'foo']]
)

Output:

          name      foo
index                  
0       value5   value6
1      value15  value16
2      value31  value32
3      value37  value38
4      value59  value60