I got a generator object which basically consists of nested lists. It contains around 20.000 lists, the structure looks like this:
cases = [[0,36,12],[64,28,1],....
Each list in this object represents rows that belong to one process. Now I want to assign a ProcessID to the respective rows of a dataframe. In the moment I achieve this using a for loop:
moc = df.iloc
processID = 0
for process in cases:
for step in process:
moc[process,-1] = processID
processID += 1
Even though this works, iterating through a for loop takes long, so I am interested in a more efficient way to assign the processID.
As I need to iterate over the cases object and since the length of the nested lists differ I do not know how to implement more efficient processes such as df.apply() or np.where().
Any help is appreciated.
Example:
import pandas as pd
import numpy as np
cases = [[1,4,2],[3,5,0],[9,6],[7,8]]
d = {'col1': ["some_information", "some_information","some_information",
"some_information","some_information","some_information",
"some_information","some_information","some_information",
"some_information"],
'processID':np.empty}
df = pd.DataFrame(data=d)
print(df)
col1 processID
0 some_information <built-in function empty>
1 some_information <built-in function empty>
2 some_information <built-in function empty>
3 some_information <built-in function empty>
4 some_information <built-in function empty>
5 some_information <built-in function empty>
6 some_information <built-in function empty>
7 some_information <built-in function empty>
8 some_information <built-in function empty>
9 some_information <built-in function empty>
moc = df.iloc
processID = 1
for case in cases:
for idx in case:
moc[idx,-1] = processID
processID += 1
print(df)
col1 processID
0 some_information 2
1 some_information 1
2 some_information 1
3 some_information 2
4 some_information 1
5 some_information 2
6 some_information 3
7 some_information 4
8 some_information 4
9 some_information 3
IIUC, here's a solution using dict comprehension with Index.repeat
and numpy.hstack
to create pandas.Series
that you can use to update your DataFrame
. The benefit of this is there are no loops.
s = pd.Series({(i+1):x for i, x in enumerate(cases)})
processes = pd.Series(s.index.repeat(s.str.len()), index=np.hstack(s))
Based on your example cases
, this will create a Series
'processes' like:
1 1
4 1
2 1
3 2
5 2
0 2
9 3
6 3
7 4
8 4
Then you can then assign into your DataFrame
:
df['processID'] = processes
setup - creating a DataFrame of len 100,000 and random cases
list:
idx = pd.Series(np.arange(100000)).sample(frac=1).values.tolist()
cases = [idx[i:i + 3] for i in range(0, len(idx), 3)]
df=pd.DataFrame({'col1':np.arange(100000),
'col2':['some_data']*100000})
Timing
%%timeit
s = pd.Series({(i+1):x for i, x in enumerate(cases)}).to_frame()
processes = pd.Series(s.index.repeat(s[0].str.len()), index=np.hstack(s[0]))
df['processID'] = processes
92.2 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)