why doesn't the looping works in onehot encoding

for i in data.columns:
    top_10 = [x for x in data.i.value_counts().sort_values(ascending=False).head(10).index]
    for label in top_10:
        data[label] = np.where(data['i'] == label, 1, 0)
    data[['i'] + top_10]

what is the mistake?

Solution

If you want to use variable i which you have in for i in data.columns: then you shouldn't use data.i but data[i] (without ' ')

for i in data.columns:

    top_10 = data[i].value_counts().sort_values(ascending=False).head(10).index

Maybe it would be more readable if you would use better name ie. column_name

for column_name in data.columns:
    
    top_10 = data[column_name].value_counts().sort_values(ascending=False).head(10).index

data.i is similar to data["i"] and it means column with name literally i, not variable i.

I don't know what you try to do with nested for-loop but you should also use data[i] instead of data["i"]

    for label in top_10:
        data[label] = np.where(data[i]==label, 1, 0)

But probably you should use better method to create labels

    for number, value in enumerate(top_10):
        data[i + '_' + str(number)] = np.where(data[i].index==value, 1, 0)

It could be more readable with different names

for column_name in data.columns:
    
    top_10 = data[column_name].value_counts().sort_values(ascending=False).head(10).index

    for number, value in enumerate(top_10):
        data[column_name + '_' + str(number)] = np.where(data[column_name].index==value, 1, 0)

But without some example data it is hard to say if it is correct.

EDIT:

Minimal working example.

I use random.seed(0) to always get the same values.

I use top_3 to see all values on screen.

import pandas as pd
import random
import numpy as np

random.seed(0) #  to get the same values every time

data = pd.DataFrame({
    "A": [random.randint(0, 10) for _ in range(10)],
    "B": [random.randint(0, 10) for _ in range(10)],
})

#print(data)

for column_name in data.columns:
    #print(data[column_name].value_counts())
    top_3 = data[column_name].value_counts().sort_values(ascending=False).head(3).index
    #print(top_3)
    for number, value in enumerate(top_3, 1):
        name = column_name + '_' + str(number)
        data[name] = np.where(data[column_name]==value, 1, 0)
        
print(data)

Result:

   A  B  A_1  A_2  A_3  B_1  B_2  B_3
0  6  9    1    0    0    0    0    0
1  6  3    1    0    0    0    0    0
2  0  8    0    0    0    0    0    1
3  4  2    0    1    0    1    0    0
4  8  4    0    0    0    0    1    0
5  7  2    0    0    1    1    0    0
6  6  1    1    0    0    0    0    0
7  4  9    0    1    0    0    0    0
8  7  4    0    0    1    0    1    0
9  5  8    0    0    0    0    0    1