Search code examples
pythonpandasnumpyloopsapply

How to rewrite a for loop to get a solution faster in pandas?


DataFrame df:

  • no column has a single int value 1-28 in increasing order, until the next group starts (dataframe is presorted). Some values are missing (e.g 1,2,3,5,7,1,2,3,28,1 ...)
  • intensity column has a single float value 0.0-1.0.
  • target has a single numpy array with indexes 0-27, which should contain all intensities of the group. Intensity with no 1 should be at 0th index of this numpy array, intensity with no 2 at 1st and so on. This array is filled with 0.0 at first
index no intensity target
0 2 0.027350 np.arange(0,27)
1 4 0.074639 np.arange(0,27)
2 5 0.056452 np.arange(0,27)
3 2 0.011223 np.arange(0,27)

The result should be:

Let's say we look at row x: in this row numpy array of df['target'] column should be rewritten so that at it's position (df['no']-1)[x] 0.0 value should be rewritten as df['intensity'][x] value,

e.g. if df['no'][0] = 2 and df['intensity'][0] = 0.027350 then df['target'][0] = [0.0, 0.027350, 0.0 ... ] If there's no such no value in this group intensity should stay 0.0

target column from example table above should look like this:

array([[0.0, 0.027350, 0.0, 0.074639, 0.056452, 0.0(at indexes 5-27)],[0.0, 0.011223, ...]])

I got this result using for loop

for i in df.index: 
    df['target'][i][(df['no'][i])-1] = df['intensity'][i]
    if i >= df.index.max():
        break
    if df['no'][i] < df['no'][i+1]:
        df['target'][i+1] = df['target'][i] + df['target'][i+1]
        df.drop(i, inplace=True)

It's too slow though, because I have millions of rows. Is there a way to use .apply() function or anything else to speed up the process?


Solution

  • You do not need an extra intensity column and modify row by row. You could do one operation for each group:

    df['group'] = (df['no'].shift(1) > df['no']).cumsum()
    def map_intensity(arr):
        new_arr = np.zeros(27)
        new_arr[arr['no'].values] = arr['intensity'].values
        return new_arr
    df.groupby('group').apply(map_intensity)