I am looking for efficient way to compute difference of all rows in dask dataframe(preferred) Or any efficient way in pandas df.I have huge file with millions of rows, it is taking like forever to compute this.. Below is example:
d = {'A': [5, 9, 15, 0, 2], 'B': [7, 6, 3, 1, 4], 'C': [8, 11, 2, 5, 3]}
df = pd.DataFrame(data=d, index=['i1', 'i2', 'i3', 'i4', 'i5'])
print(df)
tmp = pd.DataFrame(columns=['A', 'B', 'C'])
row_pointer = 1
for index, row in df.iterrows():
for i, r in df.iloc[row_pointer:].iterrows():
tmp.loc['(' + index + ' - ' + i + ')'] = df.loc[index] - df.loc[i]
row_pointer += 1
print(tmp)
Output:
A B C
i1 5 7 8
i2 9 6 11
i3 15 3 2
i4 0 1 5
i5 2 4 3
A B C
(i1 - i2) -4 1 -3
(i1 - i3) -10 4 6
(i1 - i4) 5 6 3
(i1 - i5) 3 3 5
(i2 - i3) -6 3 9
(i2 - i4) 9 5 6
(i2 - i5) 7 2 8
(i3 - i4) 15 2 -3
(i3 - i5) 13 -1 -1
(i4 - i5) -2 -3 2
Use broadcasting to do all of the subtractions. Then subset after with <
to get all unique combinations.
import pandas as pd
import numpy as np
arr = df.to_numpy()
res = pd.DataFrame(np.vstack(arr[:, None]-arr),
index=pd.MultiIndex.from_product([df.index, df.index]))
res = res[res.index.get_level_values(0) < res.index.get_level_values(1)]
print(res)
0 1 2
i1 i2 -4 1 -3
i3 -10 4 6
i4 5 6 3
i5 3 3 5
i2 i3 -6 3 9
i4 9 5 6
i5 7 2 8
i3 i4 15 2 -3
i5 13 -1 -1
i4 i5 -2 -3 2
Likely not feasible with millions of rows. Alternatively remove one level of the loop:
from itertools import chain
arr = df.to_numpy()
data = [(arr[i, None]-arr)[i+1:] for i in range(np.shape(arr)[0])]
idx = pd.MultiIndex.from_product([df.index, df.index])
idx = idx[idx.get_level_values(0) < idx.get_level_values(1)]
res = pd.DataFrame(chain.from_iterable(data), index=idx)