Search code examples
pythonpandasstandard-deviationcumsum

Among investment histories how to find the cumulative sum option that proves to be more reliable for the long term?


In fact, not always the chart that reaches the highest peak of positive value when doing cumulative sum is the most reliable for long-term investments, because a single investment may have generated a very high profit but then it returns to the normal of being negative and if become an endless fall.

Also relying on higher ROI (return on investment) is risky for the same reasons as above.

That said, the cumulative sum graphs generated by these test values are:

ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""

enter image description here

ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""

enter image description here

ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""

enter image description here

If I wanted to find the one with the biggest peak, I would do it this way:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import io

ex_csv_1 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
a,b,c,1
a,b,c,-1
a,b,c,1
"""

ex_csv_2 = """
Col 1,Col 2,Col 3,return
a,b,c,1
a,b,c,-2
a,b,c,-3
a,b,c,4
a,b,c,5
a,b,c,6
a,b,c,7
"""

ex_csv_3 = """
Col 1,Col 2,Col 3,return
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,2
a,b,c,-2
a,b,c,2
"""

def save_fig(cs):
    values = np.cumsum(cs[2])
    fig = plt.figure()
    plt.plot(values)
    fig.savefig(f'a_graph.png', dpi=fig.dpi)
    fig.clf()
    plt.close('all')

options = []

for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
    df = pd.read_csv(io.StringIO(strio), sep=",")
    df['invest'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum().gt(df['return'])
    pl = df[(df['invest'] == True)]['return']
    total_sum = pl.sum()
    roi = total_sum/len(pl)
    options.append([total_sum,roi,pl])
max_list = max(options, key=lambda sublist: sublist[0])
save_fig(max_list)

But how should I go about finding which track record among the three demonstrates keeping the smallest fluctuation and delivering the greatest long-term reliability?

I will put two charts below, the second chart that has less oscillations is the most reliable among them for the long term, as the variations are smaller and maintains a crescent with an established pattern:

enter image description here enter image description here


Solution

  • One simple measure of "reliability" in a graph is how well the graph matches with linear behavior. To calculate this, we can perform a linear regression on the data. The scipy.stats package has a nice built-in function for this. A "good" result should have a high R-value, meaning the data are behaving linearly. Also, the slope of the result should be positive, meaning it increases over time.

    results = {}
    for i,strio in enumerate([ex_csv_1,ex_csv_2,ex_csv_3]):
        df = pd.read_csv(io.StringIO(strio), sep=",")
        df['cumsum'] = df.groupby(['Col 1','Col 2','Col 3'])['return'].cumsum()
    
        # Perform the linear regression
        linreg = scipy.stats.linregress(df.index,df['cumsum'])
    
        # Save the results for comparison later
        results[i] = linreg
    
        # Plot to see how the regression matches the data
        plt.plot(df.index, df['cumsum'])
        xmin, xmax = min(df.index), max(df.index)
        plt.plot(
            [xmin, xmax], 
            [xmin*linreg.slope + linreg.intercept, xmax*linreg.slope + linreg.intercept],
            label = "slope: {:g}\nR-value:{:g}".format(linreg.slope, linreg.rvalue)
        )
        plt.legend()
        plt.show()
    results
    

    The output results are:

    {0: LinregressResult(slope=0.2857142857142857, intercept=1.1428571428571428, rvalue=0.7559289460184545, pvalue=0.04931308767365261, stderr=0.11065666703449761),
     1: LinregressResult(slope=3.0, intercept=-4.714285714285714, rvalue=0.8373248339703451, pvalue=0.01874218974109145, stderr=0.8759834123860507),
     2: LinregressResult(slope=1.2857142857142856, intercept=3.0, rvalue=0.9185586535436918, pvalue=0.0034781651152865026, stderr=0.24743582965269673)}
    

    I would interpret this as:

    1. Low R-Value: Plot one has a lot of variability. Low slope: bad return on investment
    2. OK R-Value: Plot is more consistent. High slope: good return on investment
    3. High R-Value: Plot is very consistent. High slope: good return on investment