I have a dataframe with two columns, one for names and one for string values. I'm trying to count frequency of select string values by names.
I've tried pandas.pivot_table and pandas.DataFrame.groupby but I'd like to create a whole new dataframe rather than aggregation.
For example, I have a dataframe:
import pandas as pd
import numpy as np
data = np.array([['John', 'x'], ['John', 'x'], ['John', 'x'], ['John', 'y'], ['John', 'y'], ['John', 'a'],
['Will', 'x'], ['Will', 'z']])
df = pd.DataFrame(data, columns=['name','str_value'])
df
which results in:
name str_value
0 John x
1 John x
2 John x
3 John y
4 John y
5 John a
6 Will x
7 Will z
An expected result would be:
name x y z
0 John 3 2 0
1 Will 1 0 1
and additionally:
name x y z
0 John True True False
1 Will True False True
I'd like to select x, y, z only and return True or False based on whether the returned value is 0 or NaN.
Edit: Thank you for the answers. These work great, but the output has the subgroup "str_value":
str_value x y z
name
John True True False
Will True False True
Is there a way to remove this so I have "name", "x", "y", "z" on the same level? With .reset_index() I get:
str_value name x y z
0 John True True False
1 Will True False True
Is the name of my index "str_value" now? Can I rename or drop this?
You can try:
df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']].gt(0)
Explanations:
print(df.groupby(["name", "str_value"]).size())
# John a 1
# x 3
# y 2
# Will x 1
# z 1
# dtype: int64
unstack
print(df.groupby(["name", "str_value"]).size().unstack())
# str_value a x y z
# name
# John 1.0 3.0 2.0 NaN
# Will NaN 1.0 NaN 1.0
print(df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']])
# str_value x y z
# name
# John 3.0 2.0 NaN
# Will 1.0 NaN 1.0
gt
:result = df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']].gt(0)
print(result)
# str_value x y z
# name
# John True True False
# Will True False True