Search code examples
pythonpandassumifs

Reducing execution time for SUMIFS equivalent


I am trying to reproduce the function SUMIFS in Excel which would be approximately: accumulation1 =SUMIFS(value; $fin$1:$fin$5; ini$1)

what the formula does: searches for and adds up the values in the end list that correspond to an ini

Example calculation id3 and accumulation 1 : search or add the values or endPoint(ini = 11) i.e. value of id 1 and id 5 (3+5)=8

And then creating a new accumulation column and restart same computing (I have to do this 1004 times..)

id ini fin value accumulation1 accumulation2 sumOfAccumulation
1 10 11 5 0 0 5
2 9 10 0 0 0 0
3 11 12 2 8 0 10
4 12 13 1 2 8 11
5 05 11 3 0 0 3

I have the code for an accumulation that looks like this for now :

    connection = psycopg2.connect(dbname=DB_NAME,user=DB_USER, password=DB_PWD, host=DB_HOST, port=DB_PORT)
    cursor = connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
    
    data = pdsql.read_sql_query("select id_bdcarth, id_nd_ini::int ini, id_nd_fin::int fin, v from tempturbi.tmp_somme_v19",connection)

Endtest=1
 
#loop until Endtest = 0 : 
    #create a new column accumulation  
    for i in data.ini:
        acc=[]
        acc=data.v.loc[data.fin==i] # get values of the upstream segments
        acc=sum(acc) 
        #save acc in accumulation 

    Endtest=data.sum(accumulation)     
    
    print("--- %s seconds ---" % (time.time() - start_time))

and without saving the calculated results the script takes 129 seconds to run for only which is much slower than Excel. Is there any way to improve the script and make it faster?

what i'm trying to do is to walk along a rivernetwork and compute values : enter image description here


Solution

  • Thanks again for clarifying your question. I think I understand now and this approach matches the output you've shown. Please let me know if I misunderstood, and also please let me know if this is faster than your approach. I don't know that it will be

    import pandas as pd
    
    #Create the test data
    df = pd.DataFrame({
        'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
        'ini': {0: 10, 1: 9, 2: 11, 3: 12, 4: 5},
        'fin': {0: 11, 1: 10, 2: 12, 3: 13, 4: 11},
        'value': {0: 5, 1: 0, 2: 2, 3: 1, 4: 3},
    })
    
    #Setup initial values
    curr_value_col = 'value'
    i = 0
    all_value_cols = []
    
    #The groupings stay the same throughout the loops
    #so we can just group once and reuse it for speed benefit
    gb = df.groupby('fin')
    
    #Loop forever until we break
    while True:
        #update the loop number and add to the value col list
        i += 1
        all_value_cols.append(curr_value_col)
        
        #group by fin and sum the value_col values
        fin_cumsum = gb[curr_value_col].sum()
        
        #map the sums to the new column
        next_val_col = 'accumulation{}'.format(i)
        df[next_val_col] = df['ini'].map(fin_cumsum).fillna(0).astype(int)
        
        #If the new column we added sums to 0, then quit
        #(I think this is what you were saying you wanted, but I'm not sure)
        curr_value_col = next_val_col
        if df[curr_value_col].sum() == 0:
            break
            
        
    #Get the cumulative sum from the list of columns we've been saving
    df['sumOfAccumulation'] = df[all_value_cols].sum(axis=1)
    df