Search code examples
pythonpandasstata

Pandas - Create a table with a "dummy variable"


Let's say I have this dataframe

id | car      | sex    | income
-------------------------------
1  | European | Male   | 45000
2  | Japanese | Female | 48000
3  | American | Male   | 53000

Is there an easy way to create this one (using pandas) ?

  | id | car      | choice | sex    | income
1.| 1  | European | 1      | Male   | 45000
2.| 1  | American | 0      | Male   | 45000
3.| 1  | Japanese | 0      | Male   | 45000
  | ----------------------------------------
4.| 2  | European | 0      | Female | 48000
5.| 2  | American | 0      | Female | 48000
6.| 2  | Japanese | 1      | Female | 48000
  | ----------------------------------------
7.| 3  | European | 0      | Male   | 53000
8.| 3  | American | 1      | Male   | 53000
9.| 3  | Japanese | 0      | Male   | 53000

The idea is to get every modality of the "car" variable in the base, and to indicate which one every person has chosen through the "choice" variable. For example, in the first table, the person with the id 1 has chosen an European car, so choice equals 1 in the line where car is European, and zero in the line where car equals American or Japanese.

I already coded something which manually does it (using a dictionary) but I wanted to know if a cleaner solution exists.

(This has for aim to format the data in order to use an asclogit/nlogit under Stata)


Solution

  • I believe need:

    df = df.assign(choice = 1).set_index(['id','car'])
    df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names=df.index.names))
    
    df = (df.assign(choice=df['choice'].fillna(0).astype(int))
            .groupby(level=0).apply(lambda x: x.ffill().bfill())
            .reset_index())
    print (df)
       id       car     sex   income  choice
    0   1  American    Male  45000.0       0
    1   1  European    Male  45000.0       1
    2   1  Japanese    Male  45000.0       0
    3   2  American  Female  48000.0       0
    4   2  European  Female  48000.0       0
    5   2  Japanese  Female  48000.0       1
    6   3  American    Male  53000.0       1
    7   3  European    Male  53000.0       0
    8   3  Japanese    Male  53000.0       0
    

    Explanation:

    1.First create new column with 1 by assign
    2.Create MultiIndex by set_index
    3.Generate all possible combination by MultiIndex.from_product and create new rows by reindex
    4.Then replace NaNs to 0 in choice column
    5.Last replace NaNs in all another columns by forward and backfilling.