Search code examples
pandasfor-loopdataframeprocessing-efficiency

How to assign values to cells of a dataframe efficiently iterating over another object


I got a generator object which basically consists of nested lists. It contains around 20.000 lists, the structure looks like this:

cases = [[0,36,12],[64,28,1],....

Each list in this object represents rows that belong to one process. Now I want to assign a ProcessID to the respective rows of a dataframe. In the moment I achieve this using a for loop:

moc = df.iloc
processID = 0 
for process in cases:
  for step in process:
    moc[process,-1] = processID
  processID += 1

Even though this works, iterating through a for loop takes long, so I am interested in a more efficient way to assign the processID.

As I need to iterate over the cases object and since the length of the nested lists differ I do not know how to implement more efficient processes such as df.apply() or np.where().

Any help is appreciated.

Example:

import pandas as pd
import numpy as np

cases = [[1,4,2],[3,5,0],[9,6],[7,8]]


d = {'col1': ["some_information", "some_information","some_information",
              "some_information","some_information","some_information", 
              "some_information","some_information","some_information",
              "some_information"],
    'processID':np.empty}

df = pd.DataFrame(data=d)

print(df)
               col1                  processID
0  some_information  <built-in function empty>
1  some_information  <built-in function empty>
2  some_information  <built-in function empty>
3  some_information  <built-in function empty>
4  some_information  <built-in function empty>
5  some_information  <built-in function empty>
6  some_information  <built-in function empty>
7  some_information  <built-in function empty>
8  some_information  <built-in function empty>
9  some_information  <built-in function empty>

moc = df.iloc
processID = 1
for case in cases:
    for idx in case:
        moc[idx,-1] = processID

    processID += 1


print(df)
               col1 processID
0  some_information         2
1  some_information         1
2  some_information         1
3  some_information         2
4  some_information         1
5  some_information         2
6  some_information         3
7  some_information         4
8  some_information         4
9  some_information         3

Solution

  • IIUC, here's a solution using dict comprehension with Index.repeat and numpy.hstack to create pandas.Series that you can use to update your DataFrame. The benefit of this is there are no loops.

    s = pd.Series({(i+1):x for i, x in enumerate(cases)})
    processes = pd.Series(s.index.repeat(s.str.len()), index=np.hstack(s))
    

    Based on your example cases, this will create a Series 'processes' like:

    1    1
    4    1
    2    1
    3    2
    5    2
    0    2
    9    3
    6    3
    7    4
    8    4
    

    Then you can then assign into your DataFrame:

    df['processID'] = processes
    

    Testing performance

    setup - creating a DataFrame of len 100,000 and random cases list:

    idx = pd.Series(np.arange(100000)).sample(frac=1).values.tolist()
    cases = [idx[i:i + 3] for i in range(0, len(idx), 3)]
    
    df=pd.DataFrame({'col1':np.arange(100000),
                     'col2':['some_data']*100000})
    

    Timing

    %%timeit
    
    s = pd.Series({(i+1):x for i, x in enumerate(cases)}).to_frame()
    processes = pd.Series(s.index.repeat(s[0].str.len()), index=np.hstack(s[0]))
    df['processID'] = processes
    
    92.2 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)