python-3.x pandas dataframe categories categorical-data

Comparing categorical variables between columns in pandas.DataFrame

How do I make this comparison using the set categorical rules instead of the lexicon order rule?

Given the dataset:

df = pd.DataFrame({
    'NUMBER':[12, 26, 16, 34, 38, 1, 26, 8],
    'SHIRT_SIZE':['S', 'M', 'XL', 'L', 'S', 'M', 'L', 'XL'],
    'SHIRT_SIZE2':['M', 'S', 'L', 'XL', 'M', 'L', 'XL', 'S']
})
from pandas.api.types import CategoricalDtype
c_dtype = CategoricalDtype(categories = ["S","M","L","XL"],ordered = True)
df['SHIRT_SIZE'] = df['SHIRT_SIZE'].astype(c_dtype)
df['SHIRT_SIZE2'] = df['SHIRT_SIZE2'].astype(c_dtype)

	NUMBER	SHIRT_SIZE	SHIRT_SIZE2
0	12	S	M
1	26	M	S
2	16	XL	L
3	34	L	XL
4	38	S	M
5	1	M	L
6	26	L	XL
7	8	XL	S

The dtype of 'SHIRT_SIZE' and 'SHIRT_SIZE2' is Categories (4, object): ['S' < 'M' < 'L' < 'XL'] I would like to compare the shirt sizes between the two columns 'SHIRT_SIZE' and 'SHIRT_SIZE2'

I attempted:

def compare_size(row):
    if (row['SHIRT_SIZE'] < row['SHIRT_SIZE2']):
        return 'SMALLER'
    elif (row['SHIRT_SIZE'] > row['SHIRT_SIZE2']):
        return 'LARGER'
    else:
        return 'SAME'

df['COMPARE_SIZE'] = df.apply(lambda row: compare_size(row), axis=1)

Resulting in:

	NUMBER	SHIRT_SIZE	SHIRT_SIZE2	COMPARE_SIZE
0	12	S	M	LARGER
1	26	M	S	SMALLER
2	16	XL	L	LARGER
3	34	L	XL	SMALLER
4	38	S	M	LARGER
5	1	M	L	LARGER
6	26	L	XL	SMALLER
7	8	XL	S	LARGER

Notice that there are some rows e.g. row 0 where 'S' -> 'M' and row 1 where 'M' -> 'S' do not follow the order of our categorical dtype rules

Logically, the interpretation is: "SHIRT_SIZE is <COMPARE_SIZE> THAN SHIRT_SIZE2"

I am guessing that the lexicon order of the string is the underlying rule used to compare these shirt sizes rather than the categorical rule we have set where Categories (4, object): ['S' < 'M' < 'L' < 'XL'].

I hope to compare the shirt sizes according to the categorical order.

Solution

Use numpy select to compare the values and genereate your new column:

condlist = [df.SHIRT_SIZE.gt(df.SHIRT_SIZE2), df.SHIRT_SIZE.lt(df.SHIRT_SIZE2)]
result_list = ["LARGER", "SMALLER"]
compare_size = np.select(condlist, result_list, "SAME")
df.assign(compare_size=compare_size)


    NUMBER  SHIRT_SIZE  SHIRT_SIZE2     compare_size
0   12  S   M   SMALLER
1   26  M   S   LARGER
2   16  XL  L   LARGER
3   34  L   XL  SMALLER
4   38  S   M   SMALLER
5   1   M   L   SMALLER
6   26  L   XL  SMALLER
7   8   XL  S   LARGER