Search code examples
pythonone-hot-encoding

why doesn't the looping works in onehot encoding


for i in data.columns:
    top_10 = [x for x in data.i.value_counts().sort_values(ascending=False).head(10).index]
    for label in top_10:
        data[label] = np.where(data['i'] == label, 1, 0)
    data[['i'] + top_10]

what is the mistake?


Solution

  • If you want to use variable i which you have in for i in data.columns: then you shouldn't use data.i but data[i] (without ' ')

    for i in data.columns:
    
        top_10 = data[i].value_counts().sort_values(ascending=False).head(10).index
    

    Maybe it would be more readable if you would use better name ie. column_name

    for column_name in data.columns:
        
        top_10 = data[column_name].value_counts().sort_values(ascending=False).head(10).index
    

    data.i is similar to data["i"] and it means column with name literally i, not variable i.


    I don't know what you try to do with nested for-loop but you should also use data[i] instead of data["i"]

        for label in top_10:
            data[label] = np.where(data[i]==label, 1, 0)
    

    But probably you should use better method to create labels

        for number, value in enumerate(top_10):
            data[i + '_' + str(number)] = np.where(data[i].index==value, 1, 0)
    

    It could be more readable with different names

    for column_name in data.columns:
        
        top_10 = data[column_name].value_counts().sort_values(ascending=False).head(10).index
    
        for number, value in enumerate(top_10):
            data[column_name + '_' + str(number)] = np.where(data[column_name].index==value, 1, 0)
    

    But without some example data it is hard to say if it is correct.


    EDIT:

    Minimal working example.

    I use random.seed(0) to always get the same values.

    I use top_3 to see all values on screen.

    import pandas as pd
    import random
    import numpy as np
    
    random.seed(0) #  to get the same values every time
    
    data = pd.DataFrame({
        "A": [random.randint(0, 10) for _ in range(10)],
        "B": [random.randint(0, 10) for _ in range(10)],
    })
    
    #print(data)
    
    for column_name in data.columns:
        #print(data[column_name].value_counts())
        top_3 = data[column_name].value_counts().sort_values(ascending=False).head(3).index
        #print(top_3)
        for number, value in enumerate(top_3, 1):
            name = column_name + '_' + str(number)
            data[name] = np.where(data[column_name]==value, 1, 0)
            
    print(data)   
    

    Result:

       A  B  A_1  A_2  A_3  B_1  B_2  B_3
    0  6  9    1    0    0    0    0    0
    1  6  3    1    0    0    0    0    0
    2  0  8    0    0    0    0    0    1
    3  4  2    0    1    0    1    0    0
    4  8  4    0    0    0    0    1    0
    5  7  2    0    0    1    1    0    0
    6  6  1    1    0    0    0    0    0
    7  4  9    0    1    0    0    0    0
    8  7  4    0    0    1    0    1    0
    9  5  8    0    0    0    0    0    1