Search code examples
pythonpandasdaskdask-distributeddask-delayed

Efficient way to compute difference of all rows in dask dataframe


I am looking for efficient way to compute difference of all rows in dask dataframe(preferred) Or any efficient way in pandas df.I have huge file with millions of rows, it is taking like forever to compute this.. Below is example:

d = {'A': [5, 9, 15, 0, 2], 'B': [7, 6, 3, 1, 4], 'C': [8, 11, 2, 5, 3]}
df = pd.DataFrame(data=d, index=['i1', 'i2', 'i3', 'i4', 'i5'])
print(df)
tmp = pd.DataFrame(columns=['A', 'B', 'C'])
row_pointer = 1
for index, row in df.iterrows():
    for i, r in df.iloc[row_pointer:].iterrows():
        tmp.loc['(' + index + ' - ' + i + ')'] = df.loc[index] - df.loc[i]
    row_pointer += 1
print(tmp)

Output:

 A  B   C
i1   5  7   8
i2   9  6  11
i3  15  3   2
i4   0  1   5
i5   2  4   3
         A   B   C
(i1 - i2)   -4   1  -3
(i1 - i3)  -10   4   6
(i1 - i4)    5   6   3
(i1 - i5)    3   3   5
(i2 - i3)   -6   3   9
(i2 - i4)    9   5   6
(i2 - i5)    7   2   8
(i3 - i4)   15   2  -3
(i3 - i5)   13  -1  -1
(i4 - i5)   -2  -3   2

Solution

  • Use broadcasting to do all of the subtractions. Then subset after with < to get all unique combinations.

    import pandas as pd
    import numpy as np
    
    arr = df.to_numpy()
    res = pd.DataFrame(np.vstack(arr[:, None]-arr), 
                       index=pd.MultiIndex.from_product([df.index, df.index]))
    res = res[res.index.get_level_values(0) < res.index.get_level_values(1)]
    

    print(res)
    
            0  1  2
    i1 i2  -4  1 -3
       i3 -10  4  6
       i4   5  6  3
       i5   3  3  5
    i2 i3  -6  3  9
       i4   9  5  6
       i5   7  2  8
    i3 i4  15  2 -3
       i5  13 -1 -1
    i4 i5  -2 -3  2
    

    Likely not feasible with millions of rows. Alternatively remove one level of the loop:

    from itertools import chain 
    
    arr = df.to_numpy()
    data = [(arr[i, None]-arr)[i+1:] for i in range(np.shape(arr)[0])]
    
    idx = pd.MultiIndex.from_product([df.index, df.index])
    idx = idx[idx.get_level_values(0) < idx.get_level_values(1)]
    
    res = pd.DataFrame(chain.from_iterable(data), index=idx)