Search code examples
pythonpandasdataframepivot

How to expand a list of features based off one column in pandas?


I have the following example df:

year   school    occupation     a      b     c
2020   GA Tech     engineer     2     3      4
2020  GA Tech      Business     1     5      4
.
.
.

There are many of the same occupations. I want the following dataframe:

year   school       engineer_a     engineer_ b     engineer_c Business_a  Business_b Businesss_c
2020   GA Tech       2                3                 4         1         5             4

where it is transformed into one big row for that year. How would I do this?

I have tried pivot tables and I couldn't figure out a way to make it work.


Solution

  • You could try to use .pivot in the following way:

    df = (df.pivot(index=["year", "school"], columns="occupation")
          .sort_index(level=1, axis=1))
    df.columns = [f"{b}_{a}" for a, b in df.columns]
    df = df.reset_index()
    

    Result for the short sample:

       year   school  Business_a  Business_b  Business_c  engineer_a  engineer_b  engineer_c
    0  2020  GA Tech           1           5           4           2           3           4
    

    If your comment "... there can be many of the same occupation" means that df could look like

       year   school occupation    a    b    c
    0  2020  GA Tech   engineer    2    3    4
    1  2020  GA Tech   engineer   20   30   40
    2  2020  GA Tech   Business    1    5    4
    3  2020  GA Tech   Business   10   50   40
    4  2020  GA Tech   Business  100  500  400
    

    then try

    df = (
        df.assign(count=df.groupby(["year", "school", "occupation"]).cumcount())
        .pivot(index=["count", "year", "school"], columns="occupation")
        .sort_index(level=1, axis=1)
    )
    df.columns = [f"{b}_{a}" for a, b in df.columns]
    df = df.reset_index().drop(columns="count")
    

    to get

       year   school  Business_a  Business_b  Business_c  engineer_a  engineer_b  engineer_c
    0  2020  GA Tech         1.0         5.0         4.0         2.0         3.0         4.0
    1  2020  GA Tech        10.0        50.0        40.0        20.0        30.0        40.0
    2  2020  GA Tech       100.0       500.0       400.0         NaN         NaN         NaN