Search code examples
pythonpandaslstmzero-padding

Pandas - Drop NaN's per column and pad with 0 fast?


I have a data frame such as the following with tens of thousands of rows and a thousand columns: enter image description here

For an LSTM, I would like to

  1. extract the values only per column,
  2. put them together at the beginning of the dataframe and
  3. pad 0 before the values start up to index 99.

However, please note, not every column has the same amount of values. Some have already a lot, some have none yet. Also on which timestamp the values are generated is column specific. I did achieve the result with the following code. However, since the code is really slow (700 hours), I am looking for a possibility to execute the calculation logic faster. It takes so long, since I want to calculate this kind of result for each hourly timestamps from 2008 to 2020.

Is there any way to make the code significantly faster?

df1=pd.DataFrame(index=range(100),columns=dummydata.columns)
for j in dummydata.columns:
    df1[j]=dummydata[(dummydata.index<=i)][j].dropna().iloc[-T:].iloc[::-1].reset_index(drop=True)
df1=df1.fillna(0).reset_index(drop=True)

enter image description here


Solution

  • Can you try this to see if this is faster?

    dummydata.apply(lambda x: pd.Series(x.dropna().values)).fillna(0)

    Then you can select only the first 100 rows using dummydata.loc[0:100, :]