Search code examples
pythonpercentagefinancestock

How to calculate the current row and past rows, then calculate each next row including past rows one at a time? Python


My calculation is supposed to be a number between 0 and 100 based on financial stock data. Each day this number technically starts at 0 and with each new tick/row, it calculates all past data available for that day. So as you would assume, when used live, this number is constantly changing based on each tick. The ultimate problem I am trying to solve is to put this calculation into a dataframe as a new column. What I can't seem to figure out is how do you calculate one row at a time including past rows in a dataframe.

For example, I would like to start by calculating just row 1 and 2. The result will be in its own column on row 2. The result on row 3 would be a calculation of row 1, 2, and 3. Then the same thing for row 4 and so fourth.

My calculation:


ranging=df.h.max()-df.l.min()
percent=round(100*((df.c-df.l.min())/ranging),2)

So I originally asked ChatGPT to solve this problem first before coming here. Here is what it gave me without using my calculation:

df = pd.DataFrame(columns=['data','anchored_avg'])
# define your data points
data_points = [1,2,3,4,5,6,7,8,9,10]
anchored_avg = data_points[0]
for i, current_data_point in enumerate(data_points):
    df = df.append({'data': current_data_point}, ignore_index=True)
    if i > 0:
        anchored_avg = (anchored_avg*i + current_data_point)/(i+1)
        df.loc[i,'anchored_avg'] = anchored_avg

I asked it to do this with my own calculation but the code didn't work. Here is what I tried to do with my own calculation(df is OHLC(Open, High, Low, Close) stock data):

for i, current_data_point in enumerate(df[['o','h','l','c']]):
    df= df.append(current_data_point, ignore_index=True)
    if i > 0:
        ranging=df.h.max()-df.l.min()
        percent=round(100*((df.c-df.l.min())/ranging),2)
        df.loc[i,'percent'] = percent

The error I get from this is "cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid"

Here is an example of what I did on my own but not the desired result I want: (mask_set represents dataframes for each day I want to use the calculation on)

mask_set = [mask_1,mask_2,mask_3,mask_4,mask_5,mask_6,mask_7,mask_8,mask_9,mask_10]
percent_set = []
for i in mask_set:
    ranging=df[i].h.max()-df[i].l.min()
    percent=round(100*((df[i].c-df[i].l.min())/ranging),2)
    percent_set.append(percent)
percent_comb = pd.concat(percent_set)
df['Symb_perc'] = percent_comb

Output sample:

    o   h   l   c   volume  complete    Symb_perc
time                            
2023-01-13 12:00:00 0.63638 0.63638 0.63635 0.63636 9   True    35.40
2023-01-13 12:00:05 0.63634 0.63636 0.63634 0.63636 11  True    35.40
2023-01-13 12:00:10 0.63639 0.63644 0.63638 0.63644 6   True    36.40
2023-01-13 12:00:15 0.63644 0.63651 0.63644 0.63649 12  True    37.02
2023-01-13 12:00:20 0.63650 0.63654 0.63650 0.63654 5   True    37.64
2023-01-13 12:00:25 0.63654 0.63659 0.63654 0.63659 8   True    38.26
2023-01-13 12:00:30 0.63659 0.63659 0.63652 0.63656 10  True    37.89
2023-01-13 12:00:35 0.63658 0.63658 0.63657 0.63658 5   True    38.14
2023-01-13 12:00:40 0.63658 0.63667 0.63658 0.63667 10  True    39.25
2023-01-13 12:00:45 0.63668 0.63668 0.63666 0.63667 5   True    39.25

I know this calculation is wrong because when I do use this calculation with a live feed, the numbers are way off by a large margin.


Solution

  • I think what you are looking for is the cumulative functions family. Here is the problem you want to solve, if I read correctly:

    For each row in the dataframe, calculate the number percentage that equals the value at column C, minus the minimum value from the beginning, up to that row of column L, divided by (the subtraction of the maximum value from the beginning, up to that row of column H of the minimum value from the beginning, up to that row of column L).

    Translate it to the cumulative function thing:

    For each row in the dataframe, calculate the number percentage that equals the value at column C, minus the cumulative min of L, divided by (the cumulative max of H subtracts the cumulative min of L).

    Then, we can use the cumsum(), and cummin(), cummax() functions. The full code is as below:

    import sys
    if sys.version_info[0] < 3: 
        from StringIO import StringIO
    else:
        from io import StringIO
    
    import pandas as pd
    
    TESTDATA = StringIO("""12:00:00 0.63638 0.63638 0.63635 0.63636
    12:00:05 0.63634 0.63636 0.63634 0.63636
    12:00:10 0.63639 0.63644 0.63638 0.63644
    12:00:15 0.63644 0.63651 0.63644 0.63649
    12:00:20 0.63650 0.63654 0.63650 0.63654
    12:00:25 0.63654 0.63659 0.63654 0.63659
    12:00:30 0.63659 0.63659 0.63652 0.63656
    12:00:35 0.63658 0.63658 0.63657 0.63658
    12:00:40 0.63658 0.63667 0.63658 0.63667
    12:00:45 0.63668 0.63668 0.63666 0.63667
    """)
    
    df = pd.read_csv(TESTDATA, sep=" ", names=["O", "H", "L", "C"])
    df['percentage'] = round(100*(df['C']-df['L'].cummin())/(df['H'].cummax() - df['L'].cummin()), 2)
    df
    

    it prints:

    O       H       L       C       percentage
    0.63638 0.63638 0.63635 0.63636 33.33
    0.63634 0.63636 0.63634 0.63636 50.00
    0.63639 0.63644 0.63638 0.63644 100.00
    0.63644 0.63651 0.63644 0.63649 88.24
    0.63650 0.63654 0.63650 0.63654 100.00
    0.63654 0.63659 0.63654 0.63659 100.00
    0.63659 0.63659 0.63652 0.63656 88.00
    0.63658 0.63658 0.63657 0.63658 96.00
    0.63658 0.63667 0.63658 0.63667 100.00
    0.63668 0.63668 0.63666 0.63667 97.06