for i in data.columns:
top_10 = [x for x in data.i.value_counts().sort_values(ascending=False).head(10).index]
for label in top_10:
data[label] = np.where(data['i'] == label, 1, 0)
data[['i'] + top_10]
what is the mistake?
If you want to use variable i
which you have in for i in data.columns:
then you shouldn't use data.i
but data[i]
(without ' '
)
for i in data.columns:
top_10 = data[i].value_counts().sort_values(ascending=False).head(10).index
Maybe it would be more readable if you would use better name ie. column_name
for column_name in data.columns:
top_10 = data[column_name].value_counts().sort_values(ascending=False).head(10).index
data.i
is similar to data["i"]
and it means column with name literally i
, not variable i
.
I don't know what you try to do with nested for
-loop but you should also use data[i]
instead of data["i"]
for label in top_10:
data[label] = np.where(data[i]==label, 1, 0)
But probably you should use better method to create labels
for number, value in enumerate(top_10):
data[i + '_' + str(number)] = np.where(data[i].index==value, 1, 0)
It could be more readable with different names
for column_name in data.columns:
top_10 = data[column_name].value_counts().sort_values(ascending=False).head(10).index
for number, value in enumerate(top_10):
data[column_name + '_' + str(number)] = np.where(data[column_name].index==value, 1, 0)
But without some example data it is hard to say if it is correct.
EDIT:
Minimal working example.
I use random.seed(0)
to always get the same values.
I use top_3
to see all values on screen.
import pandas as pd
import random
import numpy as np
random.seed(0) # to get the same values every time
data = pd.DataFrame({
"A": [random.randint(0, 10) for _ in range(10)],
"B": [random.randint(0, 10) for _ in range(10)],
})
#print(data)
for column_name in data.columns:
#print(data[column_name].value_counts())
top_3 = data[column_name].value_counts().sort_values(ascending=False).head(3).index
#print(top_3)
for number, value in enumerate(top_3, 1):
name = column_name + '_' + str(number)
data[name] = np.where(data[column_name]==value, 1, 0)
print(data)
Result:
A B A_1 A_2 A_3 B_1 B_2 B_3
0 6 9 1 0 0 0 0 0
1 6 3 1 0 0 0 0 0
2 0 8 0 0 0 0 0 1
3 4 2 0 1 0 1 0 0
4 8 4 0 0 0 0 1 0
5 7 2 0 0 1 1 0 0
6 6 1 1 0 0 0 0 0
7 4 9 0 1 0 0 0 0
8 7 4 0 0 1 0 1 0
9 5 8 0 0 0 0 0 1