Search code examples
pythonpandasdataframetime-seriessplit-apply-combine

pandas combine_first resulting in more number of rows


In the following data, I need to change the date in the 'DATE' column as the previous date (DATE - 1 Day) where the CLOCKDATETIME hour is less than '4:00' hrs. I've reached a point where I can get the rows with hours less than '4:00' hrs and change the date and combine the result with input, however I'm not getting the desired result, for the input 29 rows of data I'm getting the final result as 41 rows where as number of rows shall remain same. How can I combine the data frames and have the desired result (number of rows shall remain the same as input rows)?

SAMPLE DATA IN CSV FORMAT:

DATE,CARD,CLOCKDATETIME
2015-05-01,100672,2015-05-01 00:03:00
2015-05-01,350132,2015-05-01 00:03:00
2015-05-01,100327,2015-05-01 00:07:00
2015-05-01,350075,2015-05-01 00:07:00
2015-05-01,300148,2015-05-01 00:07:00
2015-05-01,300344,2015-05-01 00:09:00
2015-05-01,100799,2015-05-01 00:11:00
2015-05-01,100771,2015-05-01 00:12:00
2015-05-01,100650,2015-05-01 00:14:00
2015-05-01,100771,2015-05-01 00:15:00
2015-05-01,100186,2015-05-01 00:16:00
2015-05-01,300279,2015-05-01 00:17:00
2015-05-01,300344,2015-05-01 00:17:00
2015-05-01,300148,2015-05-01 00:22:00
2015-05-01,100650,2015-05-01 00:22:00
2015-05-01,100799,2015-05-01 00:23:00
2015-05-01,100582,2015-05-01 00:26:00
2015-05-01,100887,2015-05-01 00:27:00
2015-05-01,100887,2015-05-01 00:30:00
2015-05-01,100746,2015-05-01 08:31:00
2015-05-01,100684,2015-05-01 08:33:00
2015-05-01,100073,2015-05-01 08:33:00
2015-05-01,100771,2015-05-01 08:47:00
2015-05-01,200011,2015-05-01 08:59:00
2015-05-01,100259,2015-05-01 09:07:00
2015-05-01,100631,2015-05-01 09:07:00
2015-05-01,100746,2015-05-01 09:07:00
2015-05-01,200032,2015-05-01 09:08:00
2015-05-01,100684,2015-05-01 09:09:00

Following is the code I've for now:

import pandas as pd
from pandas.tseries.offsets import Day

bi = pd.read_csv('bi2.csv', parse_dates=[0,2])
bic = bi.sort_values(by=bi.columns[2])
bic.set_index(['CLOCKDATETIME'], inplace=True)
bid = bic.between_time('00:00','04:00')
bid.DATE = bid.DATE - Day()
bie = bid.combine_first(bic)

excess_rows = len(bie) - len(bi)

print excess_rows

Solution

  • try this:

    from __future__ import print_function
    
    import pandas as pd
    
    df = pd.read_csv('data.csv', parse_dates=['DATE','CLOCKDATETIME'])
    
    df.loc[(df['CLOCKDATETIME'].dt.hour <= 4), 'DATE'] -= pd.Timedelta('1 days')
    print(df)
    

    Output:

             DATE    CARD       CLOCKDATETIME
    0  2015-04-30  100672 2015-05-01 00:03:00
    1  2015-04-30  350132 2015-05-01 00:03:00
    2  2015-04-30  100327 2015-05-01 00:07:00
    3  2015-04-30  350075 2015-05-01 00:07:00
    4  2015-04-30  300148 2015-05-01 00:07:00
    5  2015-04-30  300344 2015-05-01 00:09:00
    6  2015-04-30  100799 2015-05-01 00:11:00
    7  2015-04-30  100771 2015-05-01 00:12:00
    8  2015-04-30  100650 2015-05-01 00:14:00
    9  2015-04-30  100771 2015-05-01 00:15:00
    10 2015-04-30  100186 2015-05-01 00:16:00
    11 2015-04-30  300279 2015-05-01 00:17:00
    12 2015-04-30  300344 2015-05-01 00:17:00
    13 2015-04-30  300148 2015-05-01 00:22:00
    14 2015-04-30  100650 2015-05-01 00:22:00
    15 2015-04-30  100799 2015-05-01 00:23:00
    16 2015-04-30  100582 2015-05-01 00:26:00
    17 2015-04-30  100887 2015-05-01 00:27:00
    18 2015-04-30  100887 2015-05-01 00:30:00
    19 2015-05-01  100746 2015-05-01 08:31:00
    20 2015-05-01  100684 2015-05-01 08:33:00
    21 2015-05-01  100073 2015-05-01 08:33:00
    22 2015-05-01  100771 2015-05-01 08:47:00
    23 2015-05-01  200011 2015-05-01 08:59:00
    24 2015-05-01  100259 2015-05-01 09:07:00
    25 2015-05-01  100631 2015-05-01 09:07:00
    26 2015-05-01  100746 2015-05-01 09:07:00
    27 2015-05-01  200032 2015-05-01 09:08:00
    28 2015-05-01  100684 2015-05-01 09:09:00