Search code examples
pythonpandasdataframepandas-groupbymulti-index

Converting Pandas groupby to a dataframe having columns with boolean values


My input data is this-

    "Name" : ["Alice", "Bob", "Jake", "Jake", "Steve" , "Bob"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"],
    "CityID" : ["1", "1", "2", "1", "1", "2"]
 } ) 

Which when printed appears as this:

   City         Name  CityID
0   Seattle    Alice       1
1   Seattle      Bob       1
2  Portland     Jake       2 
3   Seattle     Jake       1
4   Seattle    Steve       1
5  Portland      Bob       2

I am trying to get something like this -

       City CityID  Alice   Bob  Jake  Steve
0   Seattle      1   True  True  True   True
1  Portland      2  False  True  True  False

I am not sure how to get here. I did try the groupby function

df.groupby(['City','CityID', 'Name']).first()

which got me to this-

    City    CityID     Name
 Seattle         1    Alice
                        Bob
                       Jake   
Portland         2      Bob
                       Jake
                      Steve    

I am not sure how to proceed after this, any hints would be of great help.


Solution

  • Use get_dummies with dtype=bool parameter with unique values in MultiIndex by max:

    df = (pd.get_dummies(df.set_index(['City','CityID'])['Name'], dtype=bool)
            .max(level=[0,1])
            .reset_index())
    print (df)
           City CityID  Alice   Bob  Jake  Steve
    0   Seattle      1   True  True  True   True
    1  Portland      2  False  True  True  False
    

    If need groupby solution create new column filled by True with DataFrame.assign, use GroupBy.first, reshape by Series.unstack and last convert MultiIndex to columns by DataFrame.reset_index, DataFrame.rename_axis is for remove column name a:

    df = (df.assign(a=True)
            .groupby(['City','CityID', 'Name'], sort=False)['a']
            .first()
            .unstack(fill_value=False)
            .reset_index()
            .rename_axis(None, axis=1))
    print (df)
           City CityID  Alice   Bob  Jake  Steve
    0   Seattle      1   True  True  True   True
    1  Portland      2  False  True  True  False