Search code examples
pythonhashmap

Rolling mean with conditions (interview problem)


I encountered this question during an interview and can't think of a solution. This is the problem, suppose you had a dataset as follows (it goes beyond time 2 but this is just a sample to work with):

import pandas as pd

data = pd.DataFrame({
    'time': [1, 1, 1, 2, 2, 2],
    'names': ["Andy", "Bob", "Karen", "Andy", "Matt", "Sim"],
    'val': [1, 2, 3, 5, 6, 8]
})

Write a function to calculate the mean of values up till each time point but don't count duplicate names. That is, for time 1 the mean is (1+2+3)/3, for time 2 I don't include Andy's first value of '1' I only include the most recent value so the mean for time 2 is (2+3+5+6+8)/5.

I have tried creating two dictionaries, one that stores the 'time' count and the other keeping track of 'names' and 'values' but I don't know how to proceed from there or how to come up with an efficient solution so I am not recalculating means at each step (this was another requirement for the interview).

It doesn't have to be a pandas solution, the data form can be anything you prefer. I just presented it as a pandas df.


Solution

  • IIUC, you want to compute the mean of the values up to the current time, while considering only the last seen duplicates (if any). If so, here is one potential option that uses boolean indexing inside a for-loop to build the expanding windows :

    # uncomment if necessary
    # data.sort_values("time", inplace=True)
    
    to_keep = "last" # duplicate
    
    means = {}
    for t in data["time"].unique():
        window = data.loc[data["time"].le(t)]
        m = ~window["names"].duplicated(to_keep)
        means[t] = window.loc[m, "val"].mean()
    

    Output (means) :

    { # time|mean
        1: 2.0, #  (1+2+3)/3
        2: 4.8, #  (2+3+5+6+8)/5
    }