I have a dataframe that gets read in from csv and has extraneous data. Judgment on what is extraneous is made by evaluating one column, SystemStart. Any data per row that is in a column with a heading of date value lower than SystemStart for that row, is set to nan. For example, index = 'one' has a SystemStart date of '2016-1-5', and when the pd.date_range is set up, it has no nan values to populate. index= 'three' is '2016-1-7' and hence has two nan values replacing the original data.
I can go row-by-row and throw np.nan values at all columns, but that is slow. Is there a faster way?
I've created a representative dataframe below, and am looking to get the same result without iterative operations, or a way to speed up those operations. Any help would be greatly appreciated.
import pandas as pd
import numpy as np
start_date = '2016-1-05'
end_date = '2016-1-7'
dates = pd.date_range(start_date, end_date, freq='D')
dt_dates = pd.to_datetime(dates, unit='D')
ind = ['one', 'two', 'three']
df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns = dt_dates, index = ind)
df['SystemStart'] = pd.to_datetime(['2016-1-5', '2016-1-6', '2016-1-7'])
print 'Initial Dataframe: \n', df
for msn in df.index:
zero_date_range = pd.date_range(start_date, df.loc[msn,'SystemStart'] - pd.Timedelta(days=1), freq='D')
# we set zeroes for all columns in the index element in question - this is a horribly slow way to do this
df.loc[msn, zero_date_range] = np.NaN
print '\nAltered Dataframe: \n', df
Below are the df outputs, Initial and Altered:
Initial Dataframe:
2016-01-05 00:00:00 2016-01-06 00:00:00 2016-01-07 00:00:00 \
one 24 23 65
two 21 91 59
three 62 77 2
SystemStart
one 2016-01-05
two 2016-01-06
three 2016-01-07
Altered Dataframe:
2016-01-05 00:00:00 2016-01-06 00:00:00 2016-01-07 00:00:00 \
one 24.0 23.0 65
two NaN 91.0 59
three NaN NaN 2
SystemStart
one 2016-01-05
two 2016-01-06
three 2016-01-07
First thing I do is make sure SystemStart
is datetime
df.SystemStart = pd.to_datetime(df.SystemStart)
Then I strip out SystemStart
to a separate series
st = df.SystemStart
Then I drop SytstemStart
from my df
d1 = df.drop('SystemStart', 1)
Then I convert the columns I have left to datetime
d1.columns = pd.to_datetime(d1.columns)
Finally I use numpy
broadcasting to mask the appropriate cells and join SystemStart
back in.
d1.where(d1.columns.values >= st.values[:, None]).join(st)