How do I make this comparison using the set categorical rules instead of the lexicon order rule?
Given the dataset:
df = pd.DataFrame({
'NUMBER':[12, 26, 16, 34, 38, 1, 26, 8],
'SHIRT_SIZE':['S', 'M', 'XL', 'L', 'S', 'M', 'L', 'XL'],
'SHIRT_SIZE2':['M', 'S', 'L', 'XL', 'M', 'L', 'XL', 'S']
})
from pandas.api.types import CategoricalDtype
c_dtype = CategoricalDtype(categories = ["S","M","L","XL"],ordered = True)
df['SHIRT_SIZE'] = df['SHIRT_SIZE'].astype(c_dtype)
df['SHIRT_SIZE2'] = df['SHIRT_SIZE2'].astype(c_dtype)
NUMBER | SHIRT_SIZE | SHIRT_SIZE2 | |
---|---|---|---|
0 | 12 | S | M |
1 | 26 | M | S |
2 | 16 | XL | L |
3 | 34 | L | XL |
4 | 38 | S | M |
5 | 1 | M | L |
6 | 26 | L | XL |
7 | 8 | XL | S |
The dtype
of 'SHIRT_SIZE'
and 'SHIRT_SIZE2'
is Categories (4, object): ['S' < 'M' < 'L' < 'XL']
I would like to compare the shirt sizes between the two columns 'SHIRT_SIZE'
and 'SHIRT_SIZE2'
I attempted:
def compare_size(row):
if (row['SHIRT_SIZE'] < row['SHIRT_SIZE2']):
return 'SMALLER'
elif (row['SHIRT_SIZE'] > row['SHIRT_SIZE2']):
return 'LARGER'
else:
return 'SAME'
df['COMPARE_SIZE'] = df.apply(lambda row: compare_size(row), axis=1)
Resulting in:
NUMBER | SHIRT_SIZE | SHIRT_SIZE2 | COMPARE_SIZE | |
---|---|---|---|---|
0 | 12 | S | M | LARGER |
1 | 26 | M | S | SMALLER |
2 | 16 | XL | L | LARGER |
3 | 34 | L | XL | SMALLER |
4 | 38 | S | M | LARGER |
5 | 1 | M | L | LARGER |
6 | 26 | L | XL | SMALLER |
7 | 8 | XL | S | LARGER |
Notice that there are some rows e.g. row 0 where 'S' -> 'M' and row 1 where 'M' -> 'S' do not follow the order of our categorical dtype
rules
Logically, the interpretation is: "SHIRT_SIZE is <COMPARE_SIZE> THAN SHIRT_SIZE2"
I am guessing that the lexicon order of the string is the underlying rule used to compare these shirt sizes rather than the categorical rule we have set where Categories (4, object): ['S' < 'M' < 'L' < 'XL']
.
I hope to compare the shirt sizes according to the categorical order.
Use numpy select to compare the values and genereate your new column:
condlist = [df.SHIRT_SIZE.gt(df.SHIRT_SIZE2), df.SHIRT_SIZE.lt(df.SHIRT_SIZE2)]
result_list = ["LARGER", "SMALLER"]
compare_size = np.select(condlist, result_list, "SAME")
df.assign(compare_size=compare_size)
NUMBER SHIRT_SIZE SHIRT_SIZE2 compare_size
0 12 S M SMALLER
1 26 M S LARGER
2 16 XL L LARGER
3 34 L XL SMALLER
4 38 S M SMALLER
5 1 M L SMALLER
6 26 L XL SMALLER
7 8 XL S LARGER