Search code examples
pythonranking

How to replicated SQL window rank in python


If we assume i have a input dataset (list of lists) with id, data, and a score value and I would like to filter down to the highest scoring day for each id. Normally in SQL I would do this with a window and rank function but i can't think of a Pythonic way of approaching this.

Here is a native solution:

data = [
    ["123", "11/11/11", "0.5"],
    ["555", "12/11/11", "0.3"],
    ["555", "13/11/11", "0.9"],
    ["123", "14/11/11", "0.8"]
]
_sorted = sorted( data, key=lambda record: (record[0], record[2]), reverse=True)

output = []

last_id_seen = None
for record in _sorted:
    if record[0] is last_id_seen:
        continue
    last_id_seen = record[0]
    output.append(record)
print(output)

# output
# [['555', '13/11/11', '0.9'], ['123', '14/11/11', '0.8']]

But this feels clumsy and I don't know how well the sort will support a more complex situation. Also I'd ideally like to avoid a Pandas or Numpy solution as i dont think they are needed here.

Suggestions?


Solution

  • data = [
        ["123", "11/11/11", "0.5"],
        ["555", "12/11/11", "0.3"],
        ["555", "13/11/11", "0.9"],
        ["123", "14/11/11", "0.8"]
    ] # data 
    
    from itertools import groupby # groupby function
    # Sort on id and score
    _sorted = sorted( data, key=lambda record: (record[0], record[2]), reverse=True)
    
    for k, v in groupby(_sorted, lambda x: x[0]): # group by id
        # k: ids, v: groups
        print(list(v)[0]) # print
    

    I have used groupby from itertools to group sorted array on ID column. Since we have a reverse order on score key, getting the first element v[0] of each group is enough.