I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code | date | NAV |
---|---|---|
fund 1 | 2021-01-04 | 1.0000 |
fund 1 | 2021-01-05 | 1.0001 |
fund 1 | 2021-01-06 | 1.0023 |
... | ... | ... |
fund 2 | 2020-02-08 | 1.0000 |
fund 2 | 2020-02-09 | 0.9998 |
fund 2 | 2020-02-10 | 1.0001 |
... | ... | ... |
fund 3 | 2022-05-04 | 2.0021 |
fund 3 | 2022-05-05 | 2.0044 |
fund 3 | 2022-05-06 | 2.0305 |
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).