given a data frame with one descriptive column and X numeric columns, for each row I'd like to identify the top N columns with the higher values and save it as rows on a new dataframe.
For example, consider the following data frame:
df = pd.DataFrame()
df['index'] = ['A', 'B', 'C', 'D','E', 'F']
df['option1'] = [1,5,3,7,9,3]
df['option2'] = [8,4,5,6,9,2]
df['option3'] = [9,9,1,3,9,5]
df['option4'] = [3,8,3,5,7,0]
df['option5'] = [2,3,4,9,4,2]
I'd like to output (lets say N is 3, so I want the top 3):
A,option3
A,option2
A,option4
B,option3
B,option4
B,option1
C,option2
C,option5
C,option4 (or option1 - ties arent really a problem)
D,option5
D,option1
D,option2
and so on....
any idea how that can be easily achieved? Thanks
If you just want pairings:
from operator import itemgetter as it
from itertools import repeat
n = 3
# sort_values = order pandas < 0.17
new_d = (zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems())))
for _, row in df.iterrows())
for row in new_d:
print(list(row))
Output:
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')]
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')]
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')]
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')]
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]
Which also maintains the order.
If you want a list of lists:
from operator import itemgetter as it
from itertools import repeat
n = 3
new_d = [list(zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems()))))
for _, row in df.iterrows()]
Output:
[[('A', 'option3'), ('A', 'option2'), ('A', 'option4')],
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')],
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')],
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')],
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')],
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]]
Or using pythons sorted:
new_d = [list(zip(repeat(row["index"]), map(it(0), sorted(row[1:].iteritems(), key=it(1) ,reverse=1)[:n])))
for _, row in df.iterrows()]
Which is actually the fastest, if you really want strings, it is pretty trivial to format the output however you want.