python-3.x vectorization categorical-data

Response coding for categorical data

Response coding is a technique to vectorize categorical data. Let us say, we have a categorical feature named 'grade_category' which has the following unique labels - ['grades_3_5', 'grades_prek_2', 'grades_9_12', 'grades_6_8']. Assume that we are working on a classification problem with target class-labels as 0 and 1

In response-coding, you have to output probability values for each label in our feature that the label occurs with a particular class-label E.g, grades_prek_2 = [probability it occurs with class_0, probability it occurs with class 1]

Solution

def response_coding(xtrain, ytrain, feature):
            """ this method will encode the categorical features 
            using response_coding technique. 
            args:
                xtrain, ytrain, feature (all are ndarray)
            returns:
                dictionary (dict)
            """
    
    dictionary = dict()
    x = PrettyTable()
    x = PrettyTable([feature, 'class 1', 'class 0'])

    unique_cat_labels = xtrain[feature].unique()

    for i in tqdm(range(len(unique_cat_labels))):
        total_count = xtrain.loc[:,feature][(xtrain[feature] == unique_cat_labels[i])].count()
        p_0 = xtrain.loc[:, feature][((xtrain[feature] == unique_cat_labels[i]) & (ytrain==0))].count()
        p_1 = xtrain.loc[:, feature][((xtrain[feature] == unique_cat_labels[i]) & (ytrain==1))].count()

        dictionary[unique_cat_labels[i]] = [p_1/total_count, p_0/total_count]

        row = []
        row.append(unique_cat_labels[i])
        row.append(p_1/total_count)
        row.append(p_0/total_count)
        x.add_row(row)
    print()
    print(x)[![enter image description here][1]][1]
    return dictionary