I have the following example df:
year school occupation a b c
2020 GA Tech engineer 2 3 4
2020 GA Tech Business 1 5 4
.
.
.
There are many of the same occupations. I want the following dataframe:
year school engineer_a engineer_ b engineer_c Business_a Business_b Businesss_c
2020 GA Tech 2 3 4 1 5 4
where it is transformed into one big row for that year. How would I do this?
I have tried pivot tables and I couldn't figure out a way to make it work.
You could try to use .pivot
in the following way:
df = (df.pivot(index=["year", "school"], columns="occupation")
.sort_index(level=1, axis=1))
df.columns = [f"{b}_{a}" for a, b in df.columns]
df = df.reset_index()
Result for the short sample:
year school Business_a Business_b Business_c engineer_a engineer_b engineer_c
0 2020 GA Tech 1 5 4 2 3 4
If your comment "... there can be many of the same occupation" means that df
could look like
year school occupation a b c
0 2020 GA Tech engineer 2 3 4
1 2020 GA Tech engineer 20 30 40
2 2020 GA Tech Business 1 5 4
3 2020 GA Tech Business 10 50 40
4 2020 GA Tech Business 100 500 400
then try
df = (
df.assign(count=df.groupby(["year", "school", "occupation"]).cumcount())
.pivot(index=["count", "year", "school"], columns="occupation")
.sort_index(level=1, axis=1)
)
df.columns = [f"{b}_{a}" for a, b in df.columns]
df = df.reset_index().drop(columns="count")
to get
year school Business_a Business_b Business_c engineer_a engineer_b engineer_c
0 2020 GA Tech 1.0 5.0 4.0 2.0 3.0 4.0
1 2020 GA Tech 10.0 50.0 40.0 20.0 30.0 40.0
2 2020 GA Tech 100.0 500.0 400.0 NaN NaN NaN