I have a dataframe with two columns where each of the two columns contains a list indices. I want to get the product of the two lists on a row by row level to create a multiindex.
For example, df1 below
|---------------------|------------------|
| col_a | col_b |
|---------------------|------------------|
| [A1, A2] | [B1] |
|---------------------|------------------|
| [A3] | [B2, B3] |
|---------------------|------------------|
would turn into this:
MultiIndex([('A1', 'B1'),
('A2', 'B1'),
('A3', 'B2'),
('A3', 'B3')],
names = ['col_a', 'col_b'], length = 4)
What I'm doing right now is the following:
my_multiindex = df1.apply(lambda row: pd.MultiIndex.from_product([row["col_a"], row["col_b"]], names = ['col_a', 'col_b']), axis = 1)
However, the output of this is looks like:
0 MultiIndex([('A1', 'B1'), ('A2', 'B1')])
1 MultiIndex([('A3', 'B2'), ('A3', 'B3')])
One object per row, which makes sense since we are using an apply on axis 1.
Is there any way that I can "concatenate" all the output objects into a single multiindex? Preferably this can be done within the lambda function, but doing it afterwards in a separate step wouldn't be the end of the world.
If you have any suggestions for how to do the overall task (not just the specific "concatenate" step I described) that would be helpful as well. I've already tried doing this in a for loop using iterrows but that is taking way too long as the size of df1 is about 50K rows, the average length of the list in col_a is 300, and the average length of the list in col_b is 30.
Any help would be appreciated!
The easiest way is probably also the "dumbest": just take products of each row, put them all together, and feed them to pd.MultiIndex.from_tuples
.
import itertools
rowwise_products = df.apply(
lambda row: list(itertools.product(*row)),
axis=1
)
all_tuples = rowwise_products.sum() # list concatenation
>>> pd.MultiIndex.from_tuples(all_tuples, names=df.columns)
MultiIndex([('A1', 'B1'),
('A2', 'B1'),
('A3', 'B1'),
('A3', 'B2'),
('A3', 'B3')],
names=['col_a', 'col_b'])