Search code examples
pythonpandaspivot-tableanalyticsmulti-index

Categorical column to multiple count columns


Suppose we have a DataFrame:

data = {'person_id': ['person_a', 'person_a', 'person_b','person_b', 'person_c','person_c'],
        'categorical_data': ['new', 'new', 'ok', 'bad', 'new', 'bad']}
df = pd.DataFrame(data)

    person_id   categorical_data
0   person_a    new
1   person_a    new
2   person_b    ok
3   person_b    bad
4   person_c    new
5   person_c    bad

I want to expand the categorical data into multiple columns with counts of each category.

We can group by the person id to get counts:

count_categories = df.groupby('person_id')['categorical_data'].value_counts().reset_index(name='count')

person_id   categorical_data    count
0   person_a    new 2
1   person_b    bad 1
2   person_b    ok  1
3   person_c    bad 1
4   person_c    new 1

Then I tried this to create the new columns:

pivoted = count_categories.set_index(['person_id','categorical_data']).unstack('categorical_data')


count
categorical_data    bad new ok
person_id           
person_a    NaN 2.0 NaN
person_b    1.0 NaN 1.0
person_c    1.0 1.0 NaN

This is the form I want, but I'm confused by the MultiIndexing

How can I get rid of the index, or is there a better way to do this? trying reset index yields:

pivoted.reset_index() 

    person_id   count
categorical_data        bad new ok
0   person_a    NaN 2.0 NaN
1   person_b    1.0 NaN 1.0
2   person_c    1.0 1.0 NaN

Solution

  • Code

    use crosstab

    out = pd.crosstab(df['person_id'], df['categorical_data'])
    

    out

    categorical_data  bad  new  ok
    person_id                     
    person_a            0    2   0
    person_b            1    0   1
    person_c            1    1   0
    

    Or

    out1 = (pd.crosstab(df['person_id'], df['categorical_data'])
              .reset_index()
              .rename_axis(None, axis=1)
    )
    

    out1

        person_id   bad new ok
    0   person_a    0   2   0
    1   person_b    1   0   1
    2   person_c    1   1   0
    

    I don't know your exact desired output, if it's not both out and out1, plz provide it.