Search code examples
pythonpandasdataframemax

How to find the maximum date value with conditions in python?


I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?

fund code date NAV
fund 1 2021-01-04 1.0000
fund 1 2021-01-05 1.0001
fund 1 2021-01-06 1.0023
... ... ...
fund 2 2020-02-08 1.0000
fund 2 2020-02-09 0.9998
fund 2 2020-02-10 1.0001
... ... ...
fund 3 2022-05-04 2.0021
fund 3 2022-05-05 2.0044
fund 3 2022-05-06 2.0305

I tried to combined the max() function with filter() as follows:

max(filter(lambda x: x<=df['date']-timedelta(days=91))) But it didn't work.

Were this in excel, I know I could use the following functions to solve this problem: {max(if(B:B<=B2-91,B:B))}

{max(if(B:B<=B3-91,B:B))}

{max(if(B:B<=B4-91,B:B))}

....

But with python, I don't know what I could do. I just learnt it three days ago. Please help me.

This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.

This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.

I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.


Solution

  • what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).

    I could recreated you example as follow (to create the data frame you have):

    import pandas as pd
    import datetime
    from random import randint
    df = pd.DataFrame(columns=["fund code", "date", "NAV"])
    date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
    for i in range(10):
        df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
        
    for i in range(20, 25):
        df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
        
    for i in range(20, 25):
        df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
    

    this will look like your example, with not continuous dates and two different funds.

    The maximum sliding window (for variable days length look like this)

    import queue
    class max_queue:
        def __init__(self, win=7):
            self.win = win
            self.queue = queue.deque()
            self.date = None
        def append(self, date, value):
            
            while self.queue and value > self.queue[-1][1]:
                self.queue.pop()
                
            while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
                self.queue.popleft()
                
            self.queue.append((date, value))
            self.date = date
            
        def get_max(self):
            return self.queue[0][1]
        
    

    now you could simply iterate over rows and get the max value in the timeframe you are interested.

    mq = max_queue(7)
    pre_code = ''
    for idx, row in df.iterrows():
        code, date, nav,*_ = row
        if code != pre_code:
            mq = max_queue(7)
        pre_code = code
        mq.append(date, nav)
        df.at[idx, 'max'] = mq.get_max()
    

    results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.

    enter image description here

    using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).