If we assume i have a input dataset (list of lists) with id, data, and a score value and I would like to filter down to the highest scoring day for each id. Normally in SQL I would do this with a window and rank function but i can't think of a Pythonic way of approaching this.
Here is a native solution:
data = [
["123", "11/11/11", "0.5"],
["555", "12/11/11", "0.3"],
["555", "13/11/11", "0.9"],
["123", "14/11/11", "0.8"]
_sorted = sorted( data, key=lambda record: (record[0], record[2]), reverse=True)
output = []
last_id_seen = None
for record in _sorted:
if record[0] is last_id_seen:
last_id_seen = record[0]
# output
# [['555', '13/11/11', '0.9'], ['123', '14/11/11', '0.8']]
But this feels clumsy and I don't know how well the sort will support a more complex situation. Also I'd ideally like to avoid a Pandas or Numpy solution as i dont think they are needed here.
data = [
["123", "11/11/11", "0.5"],
["555", "12/11/11", "0.3"],
["555", "13/11/11", "0.9"],
["123", "14/11/11", "0.8"]
] # data
from itertools import groupby # groupby function
# Sort on id and score
_sorted = sorted( data, key=lambda record: (record[0], record[2]), reverse=True)
for k, v in groupby(_sorted, lambda x: x[0]): # group by id
# k: ids, v: groups
print(list(v)[0]) # print
I have used groupby from itertools to group sorted array on ID column. Since we have a reverse order on score key, getting the first element v[0]
of each group is enough.