Aim: Get a summary table of counts of categories that are nested within another category.
Example: I have this dataframe:
# initialize data of lists.
data = {'Name': ['Tom', 'Tom', 'Tom', 'jack', 'jack', 'Mary', 'Mary', 'Mary', 'Jim', 'Jim'],
'CEFR_level': ['A1', 'A2', 'PreA1', 'A2', 'A1','A1','B1','C1', 'A1', 'B1']}
# Create DataFrame
df = pd.DataFrame(data)
# I use this code to recode:
Easy = ["PreA1","A1", "A2"]
Medium =["B1", "B2"]
Hard = ["C1", "C2"]
data ['CEFR_categories'] = np.where(data['CEFR_level'].isin(Easy), 'Easy',
np.where(data['CEFR_level'].isin(Medium), 'Medium',
np.where(data ['CEFR_level'].isin(Hard), 'Hard', 'Other')))
I managed to make the column: data ['CEFR_categories'] and categorize properly the Easy, Medium and Hard.
My problem is with the groupby.
Task: DONE I want to recode the X, Y and Z into easy, medium and hard.
I then I want to groupby combining categories. For example, the new easy category, repeats 2 times (Tom (who has in the CEFR_level 'A1', 'A2', 'PreA1') and Jack (who has A1 and A2 in the CEFR level)). Easy-Medium-Hard (repeats 1 time for Mary (which a different combination of CEFR_level which is then recoded to easy, medium and hard)) and Easy-Medium repeats 1 time for Jim.
I have for hours tried to recode this, I can recode into another column but I only have 1 category (for example) easy for the first row. (With the code above)
My output should look like this:
How can i group by this.
Thanks for your help
EDIT and UPDATE
I used @Timeless answer I got the following output:
Any suggestions? The first 4 lines in my real data has a cat1 of: easy, easy, easy, medium. Which would result in a Easy-Medium.
But the ouput says there is none.
Final answer
This code by timeless also works.
cats = sorted(testlet_item_bank["CEFR_categories"].unique())
#status = dict(zip(cats, ["Easy", "Medium", "Hard"])) # this was mixing categories
ps = list(map("-".join, powerset(cats)))[1:]
out = (
testlet_item_bank # the first chain can be optional
.astype({"CEFR_categories": pd.CategoricalDtype(cats, ordered=True)})
.groupby("TestletID")["CEFR_categories"]
.agg(lambda x: "-".join(pd.unique(x.sort_values())))
.value_counts()
.reindex(ps, fill_value=0)
.rename_axis("Categories")
.reset_index(name="Counts")
# .replace(status, regex=True) this mixes categories
)
With a value_counts
/powerset
:
from more_itertools import powerset
mapper = {
"Easy": ["PreA1", "A1", "A2"],
"Medium": ["B1", "B2"],
"Hard": ["C1", "C2"],
}
status = {v: k for k,lst_v in mapper.items() for v in lst_v}
df["CEFR_level"] = (
df["CEFR_level"].map(status).fillna("Other")
.astype(pd.CategoricalDtype(list(mapper) + ["Other"], ordered=True))
)
ps = list(map("-".join, powerset(mapper)))[1:]
out = (
df # the first chain can be optional
.groupby("Name")["CEFR_level"]
.agg(lambda x: "-".join(pd.unique(x.sort_values())))
.value_counts()
.reindex(ps, fill_value=0)
.rename_axis("Categories")
.reset_index(name="Counts")
)
NB : If you can't install more_itertools
, you can use this recipe from the docs.
Output :
Categories Counts
0 Easy 2
1 Medium 0
2 Hard 0
3 Easy-Medium 1
4 Easy-Hard 0
5 Medium-Hard 0
6 Easy-Medium-Hard 1
[7 rows x 2 columns]