Search code examples
pythonpandasnumpy

Applying a function to each row of a pandas dataframe, not working quite right


I am rusty with Pandas, please be gentle!

I have a dataframe that is (349, 17) of various water sample values (pH, salinity, temperature, etc). I am using the PyCO2SYS toolbox to calculate chemical outputs. I've created a function that should use the dataframe row index, pull variables from the specified column associated with the row, and return the variable (using PyCO2SYS) I want.

Here's the function:

def pCO2_column(i):
# input variables
    PAR1 = df['TA (umol/kg)'][i]        # ALK
    PAR2 = df['pH'][i]                  # pH                                  
    SAL = df['Sal psu'][i]              # Salinity
    TEMPIN = df['Temp C'][i]            # Temperature (input)
    TEMPOUT = TEMPIN                    # Temperature (output)
    PRESIN = df['Pressure psi a'][i]    # Pressure (input)
    PRESOUT = PRESIN                    # Pressure (output)
# Result I want to add into column
    pCO2_out = pyco2.sys(PAR1, PAR2, PAR1TYPE, PAR2TYPE, SAL, TEMPIN, TEMPOUT, PRESIN, PRESOUT, pHSCALEIN, K1K2CONSTANTS, KSO4CONSTANTS)["pCO2_out"]
    return pCO2_out

Note: the other parameters were globally defined; the ones in the function are the ones that will change with each row

I want to use this function for every row index, to create a column of those values I want. I have been able to do it in a clunky way but I want to optimize it. One way I did it was to apply my function to each row based on that index:

df['pCO2_out (μatm)'] = df.apply(lambda row: pCO2_column(df.index), axis=0)

HOWEVER, when I first run it, it gives me the following error:
ValueError: Wrong number of items passed 17, placement implies 1
If I change it to axis=1, each row contains EVERY valuable calculated for all the rows, in an array. (https://i.sstatic.net/5QaMH.png)
If I change it back to axis=0, it populates correctly, with a single unique value in each row. (https://i.sstatic.net/qkaAM.png)
I know I could also loop through each row, fill an array with the values, then insert that array as a new column...

This seems incredibly simple but I don't know where I've gone wrong. Any advice?


Solution

  • You've structured your lambda function incorrectly check the doc or see some examples online.

    Specifically you don't need to iterate by index, as with axis=1 you're getting each row already. To fix your code with a minimal example see the below:

    df_p = pd.DataFrame({'pH':np.random.random(10)})
    def pCO2_column(row):
    # input variables
        PAR2 = row['pH']
        return PAR2
    df_p.apply(pCO2_column, axis=1)
    
    

    Notice I don't need the row index, and just selecting the column as row will be a series