Search code examples
python-2.7pandaslambdamappingboolean-operations

How to apply operations on rows of a dataframe, but with variable columns affected?


I have a dataframe that gets read in from csv and has extraneous data. Judgment on what is extraneous is made by evaluating one column, SystemStart. Any data per row that is in a column with a heading of date value lower than SystemStart for that row, is set to nan. For example, index = 'one' has a SystemStart date of '2016-1-5', and when the pd.date_range is set up, it has no nan values to populate. index= 'three' is '2016-1-7' and hence has two nan values replacing the original data.

I can go row-by-row and throw np.nan values at all columns, but that is slow. Is there a faster way?

I've created a representative dataframe below, and am looking to get the same result without iterative operations, or a way to speed up those operations. Any help would be greatly appreciated.

import pandas as pd
import numpy as np

start_date = '2016-1-05'
end_date = '2016-1-7'
dates = pd.date_range(start_date, end_date, freq='D')
dt_dates = pd.to_datetime(dates, unit='D')
ind = ['one', 'two', 'three']

df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns = dt_dates, index = ind)

df['SystemStart'] = pd.to_datetime(['2016-1-5', '2016-1-6', '2016-1-7'])

print 'Initial Dataframe: \n', df

for msn in df.index:
    zero_date_range =  pd.date_range(start_date, df.loc[msn,'SystemStart'] - pd.Timedelta(days=1), freq='D')

    # we set zeroes for all columns in the index element in question - this is a horribly slow way to do this
    df.loc[msn, zero_date_range] = np.NaN

print '\nAltered Dataframe: \n', df

Below are the df outputs, Initial and Altered:

Initial Dataframe: 
       2016-01-05 00:00:00  2016-01-06 00:00:00  2016-01-07 00:00:00  \
one                     24                   23                   65   
two                     21                   91                   59   
three                   62                   77                    2   

      SystemStart  
one    2016-01-05  
two    2016-01-06  
three  2016-01-07  

Altered Dataframe: 
       2016-01-05 00:00:00  2016-01-06 00:00:00  2016-01-07 00:00:00  \
one                   24.0                 23.0                   65   
two                    NaN                 91.0                   59   
three                  NaN                  NaN                    2   

      SystemStart  
one    2016-01-05  
two    2016-01-06  
three  2016-01-07  

Solution

  • First thing I do is make sure SystemStart is datetime

    df.SystemStart = pd.to_datetime(df.SystemStart)
    

    Then I strip out SystemStart to a separate series

    st = df.SystemStart
    

    Then I drop SytstemStart from my df

    d1 = df.drop('SystemStart', 1)
    

    Then I convert the columns I have left to datetime

    d1.columns = pd.to_datetime(d1.columns)
    

    Finally I use numpy broadcasting to mask the appropriate cells and join SystemStart back in.

    d1.where(d1.columns.values >= st.values[:, None]).join(st)
    

    enter image description here