Need help calculating IRR for different investments at different dates for each investment, in python

I need help calculating IRR for different investments, and the IRR for those investments at different times.

So a have one dataframe that looks like this:

DATE	Investment	Flow
2012-05-12	1	-50
2013-09-04	1	100
2014-05-05	1	300
2013-09-04	2	-700
2015-05-12	2	1000
2012-04-04	3	100
2013-05-12	3	-50
2013-09-04	4	-60

And another one that looks like this

DATE	Investment	Stock
2012-09-05	1	400
2014-05-05	1	600
2014-05-05	2	300
2013-09-04	2	800
2012-09-14	3	1000
2013-09-05	4	6000

So I want to create multiple dataframes that contains the flow of each investment up until the date that I have information on the stock, with the last row containing the stock for that date. So for example, I have 2 observations on the stock for investment 1 so I should create 2 dataframes por investment 1 that look like this:

DATE	Investment	Flow + Stock(last row)
2012-05-12	1	-50
2012-09-05	1	400

DATE	Investment	Flow + Stock(last row)
2012-05-12	1	-50
2013-09-04	1	100
2014-05-05	1	300
2014-05-05	1	600

And for investment 3, give that I only have one observation on stock, there should only be 1 dataframe that look like this:

DATE	Investment	Flow + Stock(last row)
2012-04-04	3	100
2012-09-14	3	1000

Given that I have a lot of data, manually creating each dataframe is cumbersome, plus I would like this code to update the IRR when I have new information. I want to do this because I want to see the evolution of IRR for each date that I have information on the stock for each investment. Sort of like a time series of IRR for the investments. I will calculate the IRR with the dataframes created.

I have tried ranking the Dates for which I have information on stock, for each investment but have trouble with the loop.

Thank you very much

EDIT: Per request of Henry Ecker, this is a sample of the merged database.

         DATE_x       Investment         Flow     DATE_y         Stock
355  2018-08-29            1            1371300 2020-09-30    2904678,03
3076 2016-03-31            2           -4535569 2015-06-30             0
1564 2017-11-28            3            1142227 2014-09-30   10378007,31
3666 2018-02-22            2            1622857 2020-03-31  122203846,09
1394 2017-05-16            3            3116642 2017-12-31             0
472  2013-11-09            3           -4364500 2015-12-31   45789217,93
446  2021-02-23            1             325117 2020-03-31   13176648,97
1641 2018-01-31            3             623695 2015-09-30             0
1297 2017-03-21            3            1146193 2015-09-30    32103654,6
2080 2020-09-15            3             461123 2017-09-30   47763628,79

Solution

One way to go about this is to join flows and observations to get the associations, then group by observation date and Investment id to get each group that we're interested in.

The function process_df is ued to filter out only days that come before the observation date (DATE_y).

Grab the values Investment, Observation Date (DATE_y) and Stock value from the first row since they are all the same in the group and append that to the end of your table. Then just cleanup everything drop the extra columns (Stock and DATE_y), reset the index, and rename the columns to reflect your desired output.

import pandas as pd

flows = pd.DataFrame({'DATE': {0: '2012-05-12', 1: '2013-09-04',
                               2: '2014-05-05', 3: '2013-09-04',
                               4: '2015-05-12', 5: '2012-04-04',
                               6: '2013-05-12', 7: '2013-09-04',
                               8: '2020-05-12', 9: '2016-07-12'},
                      'Investment': {0: 1, 1: 1, 2: 1, 3: 2,
                                     4: 2, 5: 3, 6: 3, 7: 4,
                                     8: 5, 9: 7},
                      'Flow': {0: -50, 1: 100, 2: 300, 3: -700,
                               4: 1000, 5: 100, 6: -50, 7: -60,
                               8: 100, 9: 800}})
flows['DATE'] = flows['DATE'].astype('datetime64[ns]')

observations = pd.DataFrame({'DATE': {0: '2012-09-05', 1: '2014-05-05',
                                      2: '2014-05-05', 3: '2013-09-04',
                                      4: '2012-09-14', 5: '2013-09-05',
                                      6: '2014-05-14', 7: '2015-12-14'},
                             'Investment': {0: 1, 1: 1, 2: 2,
                                            3: 2, 4: 3, 5: 4,
                                            6: 5, 7: 6},
                             'Stock': {0: 400, 1: 600, 2: 300,
                                       3: 800, 4: 1000, 5: 6000,
                                       6: 0, 7: 15}})
observations['DATE'] = observations['DATE'].astype('datetime64[ns]')


def process_df(df):
    out = df[df['DATE_x'] <= df['DATE_y']]  # Filter Out Out of Bound Dates
    if out.empty:
        # Handle Case Where Observation but No flows
        return df[['DATE_y', 'Investment', 'Stock']] \
            .reset_index(drop=True) \
            .rename(columns={'DATE_y': 'DATE', 'Stock': 'Flow + Stock(last row)'})
    return out.drop(['DATE_y', 'Stock'], axis=1) \
        .append(out[['Investment', 'DATE_y', 'Stock']]
                .iloc[0]
                .rename({'DATE_y': 'DATE_x', 'Stock': 'Flow'})) \
        .reset_index(drop=True) \
        .rename(columns={'DATE_x': 'DATE', 'Flow': 'Flow + Stock(last row)'})


merged = pd.merge(flows, observations, on='Investment', how='right')

dfs = [process_df(group) for _, group in merged.groupby(['Investment', 'DATE_y'])]

# For Display
for i, new_df in enumerate(dfs):
    print(f'DataFrame {i+1}')
    print(new_df)
    print()

dfs is a list containing the individual DataFrames.

Output:

DataFrame 1
        DATE  Investment  Flow + Stock(last row)
0 2012-05-12           1                   -50.0
1 2012-09-05           1                   400.0

DataFrame 2
        DATE  Investment  Flow + Stock(last row)
0 2012-05-12           1                   -50.0
1 2013-09-04           1                   100.0
2 2014-05-05           1                   300.0
3 2014-05-05           1                   600.0

DataFrame 3
        DATE  Investment  Flow + Stock(last row)
0 2013-09-04           2                  -700.0
1 2013-09-04           2                   800.0

DataFrame 4
        DATE  Investment  Flow + Stock(last row)
0 2013-09-04           2                  -700.0
1 2014-05-05           2                   300.0

DataFrame 5
        DATE  Investment  Flow + Stock(last row)
0 2012-04-04           3                   100.0
1 2012-09-14           3                  1000.0

DataFrame 6
        DATE  Investment  Flow + Stock(last row)
0 2013-09-04           4                   -60.0
1 2013-09-05           4                  6000.0

DataFrame 7
        DATE  Investment  Flow + Stock(last row)
0 2014-05-14           5                       0

DataFrame 8
        DATE  Investment  Flow + Stock(last row)
0 2015-12-14           6                      15

Edit Notes:

My initial implementation made a false assumption that that all observations would have at least one flow. I made a guess as to how you'd like to receive the information about observations with no flow associations and chose to return a DataFrame that still has the last row with the stock information but no flows. If you'd prefer an empty DataFrame instead, please just return out.

    if out.empty:
        # Handle Case Where Observation but No flows
        return out

I Added 3 additional test cases to the sample data.
1. There is a flow, but it occurs after the observation
2. There are no flows for a given investment id
3. There is a flow but no observation.
  - Given that your generated DataFrames are based on observations, I've chosen to exclude flows that have no match in observations by using a 'right' join.