Search code examples
pythonpandasdataframeaverageexport-to-csv

Calculating averaged data in and writing to csv from a pandas dataframe


I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations. The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation. Data looks like this

x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6

For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.

  1. Calculate the average of one numeric value for all the values with a third specific text value
  2. Sum another numeric value for all the values with the same text value
  3. Write the a resultant table of 'x, y, average, sum' to a csv.

My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.

What have I done wrong and how can I speed it up?

    for text1 in text_list1:
        for text2 in text_list2:
            # Get the data into smaller dataframe
            df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
                
            #Get the minimum and maximum x and y 
            minXw = df['x'].min()
            maxXw = df['x'].max()
            minYw = df['y'].min()
            maxYw = df['y'].max()
            
            # dictionary for quicker printing
            dict_out  = {}
            rows_list = []
            
            # Make output filename
            filenameOut = text1+"_"+text2+"_Values.csv"
            # Start looping through x values
            for x in np.arange(minXw, maxXw, x_inc):
                xcount += 1
                # Start looping through y values
                for y in np.arange(minYw, maxYw, y_inc):
                    ycount += 1
                    
                    # calculate average and sum
                    ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
                    sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()

                    # Make Dictionary of output values
                    dict_out  = dict([('text1', text1), 
                                      ('text2', text2), 
                                      ('text3', df['text3']),
                                      ('x'    , x-x_inc),
                                      ('y'    , y-y_inc),
                                      ('ave'  , ave_val),
                                      ('sum'   , sum_val)])
                    rows_list_c.append(dict_out)

            # Write csv      
            columns = ['text1','text2','text3','x','y','ave','sum']
            with open(filenameOut, 'w') as csvfile:
                writer = csv.DictWriter(csvfile, fieldnames=columns)
                writer.writeheader()
                for data in dict_out:
                    writer.writerow(data)

My resultant csv gives me:

text1,text2,text3,x,y,ave,sum

text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0

Solution

  • Not really clear what you're trying to do. But here is a starting point

    If you only need to process rows with a specific text3value, start by filtering out the other rows:

    df = df[df.text3=="my_value"]
    

    If at this point, you do not need text3 anymore, you can also drop it

    df = df.drop(columns="text3")
    

    Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:

    for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
        filenameOut = text1+"_"+text2+"_Values.csv"
        # Process sub df
        output_df = process(sub_df)
        # Write sub df
        output_df.to_csv(filenameOut)
    

    Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.

    Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).

    At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.

    def process(sub_df):
       # drop the text1 and text2 columns since they are in the filename anyway
       out = sub_df.drop(columns=["text1","text2"])
       
       # Compute mean and max
       return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
    

    And that's preety much it.

    Bonus: 2-liner version (but don't do that...)

    for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
        sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")