Let's say I have this dataframe
id | car | sex | income
-------------------------------
1 | European | Male | 45000
2 | Japanese | Female | 48000
3 | American | Male | 53000
Is there an easy way to create this one (using pandas) ?
| id | car | choice | sex | income
1.| 1 | European | 1 | Male | 45000
2.| 1 | American | 0 | Male | 45000
3.| 1 | Japanese | 0 | Male | 45000
| ----------------------------------------
4.| 2 | European | 0 | Female | 48000
5.| 2 | American | 0 | Female | 48000
6.| 2 | Japanese | 1 | Female | 48000
| ----------------------------------------
7.| 3 | European | 0 | Male | 53000
8.| 3 | American | 1 | Male | 53000
9.| 3 | Japanese | 0 | Male | 53000
The idea is to get every modality of the "car" variable in the base, and to indicate which one every person has chosen through the "choice" variable. For example, in the first table, the person with the id
1 has chosen an European car, so choice
equals 1 in the line where car
is European, and zero in the line where car
equals American or Japanese.
I already coded something which manually does it (using a dictionary) but I wanted to know if a cleaner solution exists.
(This has for aim to format the data in order to use an asclogit/nlogit under Stata)
I believe need:
df = df.assign(choice = 1).set_index(['id','car'])
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names=df.index.names))
df = (df.assign(choice=df['choice'].fillna(0).astype(int))
.groupby(level=0).apply(lambda x: x.ffill().bfill())
.reset_index())
print (df)
id car sex income choice
0 1 American Male 45000.0 0
1 1 European Male 45000.0 1
2 1 Japanese Male 45000.0 0
3 2 American Female 48000.0 0
4 2 European Female 48000.0 0
5 2 Japanese Female 48000.0 1
6 3 American Male 53000.0 1
7 3 European Male 53000.0 0
8 3 Japanese Male 53000.0 0
Explanation:
1.First create new column with 1
by assign
2.Create MultiIndex
by set_index
3.Generate all possible combination by MultiIndex.from_product
and create new rows by reindex
4.Then replace NaN
s to 0
in choice
column
5.Last replace NaN
s in all another columns by forward and backfilling.