I have a pandas DataFrame as follows:
import pandas as pd
import numpy as np
data = {"first_column": ["item1", "item2", "item3", "item4", "item5", "item6", "item7"],
"second_column": ["cat1", "cat1", "cat1", "cat2", "cat2", "cat2", "cat2"],
"third_column": [5, 1, 8, 3, 731, 189, 9]}
df = pd.DataFrame(data)
df
first_column second_column third_column
0 item1 cat1 5
1 item2 cat1 1
2 item3 cat1 8
3 item4 cat2 3
4 item5 cat2 731
5 item6 cat2 189
6 item7 cat2 9
Now, let's say I wanted to create a fourth column showing the classification of the third column using pandas.cut()
. Here, I label each row whether the element in third_column
is less than or equal to ten, <=10
.
df["less_than_ten"]= pd.cut(df.third_column, [-np.inf, 10, np.inf], labels=(1,0))
And the resulting dataframe is now:
first_column second_column third_column less_than_ten
0 item1 cat1 5 1
1 item2 cat1 1 1
2 item3 cat1 8 1
3 item4 cat2 3 1
4 item5 cat2 731 0
5 item6 cat2 189 0
6 item7 cat2 9 1
Question: Notice the second column second_column
, with categories cat1
and cat2
. How would I use pandas.cut()
to reclassify these values based on the "class" in second_column
?
More importantly, let's say I wanted more complex intervals, e.g. less or equal to 500 le(500) and greater than or equal to 20 ge(20)? How would this be done? In this case, there should be a 1 labeled by grouping:
first_column second_column third_column less_than_ten
0 item1 cat1 5 1
1 item2 cat1 1 1
2 item3 cat1 8 1
3 item4 cat2 3 1
4 item5 cat2 731 0
5 item6 cat2 189 1
6 item7 cat2 9 1
I wouldn't use pd.cut
in this case:
df['less_than_ten'] = df.third_column.le(10).astype(np.uint8)
df.loc[df.second_column=='cat2','less_than_ten'] = \
df.loc[df.second_column=='cat2','third_column'].le(10).astype(np.uint8) + 2
Result:
In [99]: df
Out[99]:
first_column second_column third_column less_than_ten
0 item1 cat1 5 1
1 item2 cat1 1 1
2 item3 cat1 8 1
3 item4 cat2 3 3
4 item5 cat2 731 2
5 item6 cat2 189 2
6 item7 cat2 9 3