Search code examples
pythonpandaspysparkmultiple-columnssubtraction

Subtract consecutive columns in a Pandas or Pyspark Dataframe


I would like to perform the following operation in a pandas or pyspark dataframe but i still havent found a solution.

I want to subtract the values from consecutive columns in a dataframe.

The operation I am describing can be seen in the image below.

Input and Output Dataframe

Bear in mind that the output dataframe wont have any values on first column as the first column in the input table cannot be subtracted by its previous one as it doesn't exist.


Solution

  • diff has an axis param so you can just do this in one step:

    In [63]:
    df = pd.DataFrame(np.random.rand(3, 4), ['row1', 'row2', 'row3'], ['A', 'B', 'C', 'D'])
    df
    
    Out[63]:
                 A         B         C         D
    row1  0.146855  0.250781  0.766990  0.756016
    row2  0.528201  0.446637  0.576045  0.576907
    row3  0.308577  0.592271  0.553752  0.512420
    
    In [64]:
    df.diff(axis=1)
    
    Out[64]:
           A         B         C         D
    row1 NaN  0.103926  0.516209 -0.010975
    row2 NaN -0.081564  0.129408  0.000862
    row3 NaN  0.283694 -0.038520 -0.041331