In the following data, I need to change the date in the 'DATE' column as the previous date (DATE - 1 Day) where the CLOCKDATETIME hour is less than '4:00' hrs. I've reached a point where I can get the rows with hours less than '4:00' hrs and change the date and combine the result with input, however I'm not getting the desired result, for the input 29 rows of data I'm getting the final result as 41 rows where as number of rows shall remain same. How can I combine the data frames and have the desired result (number of rows shall remain the same as input rows)?
SAMPLE DATA IN CSV FORMAT:
DATE,CARD,CLOCKDATETIME
2015-05-01,100672,2015-05-01 00:03:00
2015-05-01,350132,2015-05-01 00:03:00
2015-05-01,100327,2015-05-01 00:07:00
2015-05-01,350075,2015-05-01 00:07:00
2015-05-01,300148,2015-05-01 00:07:00
2015-05-01,300344,2015-05-01 00:09:00
2015-05-01,100799,2015-05-01 00:11:00
2015-05-01,100771,2015-05-01 00:12:00
2015-05-01,100650,2015-05-01 00:14:00
2015-05-01,100771,2015-05-01 00:15:00
2015-05-01,100186,2015-05-01 00:16:00
2015-05-01,300279,2015-05-01 00:17:00
2015-05-01,300344,2015-05-01 00:17:00
2015-05-01,300148,2015-05-01 00:22:00
2015-05-01,100650,2015-05-01 00:22:00
2015-05-01,100799,2015-05-01 00:23:00
2015-05-01,100582,2015-05-01 00:26:00
2015-05-01,100887,2015-05-01 00:27:00
2015-05-01,100887,2015-05-01 00:30:00
2015-05-01,100746,2015-05-01 08:31:00
2015-05-01,100684,2015-05-01 08:33:00
2015-05-01,100073,2015-05-01 08:33:00
2015-05-01,100771,2015-05-01 08:47:00
2015-05-01,200011,2015-05-01 08:59:00
2015-05-01,100259,2015-05-01 09:07:00
2015-05-01,100631,2015-05-01 09:07:00
2015-05-01,100746,2015-05-01 09:07:00
2015-05-01,200032,2015-05-01 09:08:00
2015-05-01,100684,2015-05-01 09:09:00
Following is the code I've for now:
import pandas as pd
from pandas.tseries.offsets import Day
bi = pd.read_csv('bi2.csv', parse_dates=[0,2])
bic = bi.sort_values(by=bi.columns[2])
bic.set_index(['CLOCKDATETIME'], inplace=True)
bid = bic.between_time('00:00','04:00')
bid.DATE = bid.DATE - Day()
bie = bid.combine_first(bic)
excess_rows = len(bie) - len(bi)
print excess_rows
try this:
from __future__ import print_function
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['DATE','CLOCKDATETIME'])
df.loc[(df['CLOCKDATETIME'].dt.hour <= 4), 'DATE'] -= pd.Timedelta('1 days')
print(df)
Output:
DATE CARD CLOCKDATETIME
0 2015-04-30 100672 2015-05-01 00:03:00
1 2015-04-30 350132 2015-05-01 00:03:00
2 2015-04-30 100327 2015-05-01 00:07:00
3 2015-04-30 350075 2015-05-01 00:07:00
4 2015-04-30 300148 2015-05-01 00:07:00
5 2015-04-30 300344 2015-05-01 00:09:00
6 2015-04-30 100799 2015-05-01 00:11:00
7 2015-04-30 100771 2015-05-01 00:12:00
8 2015-04-30 100650 2015-05-01 00:14:00
9 2015-04-30 100771 2015-05-01 00:15:00
10 2015-04-30 100186 2015-05-01 00:16:00
11 2015-04-30 300279 2015-05-01 00:17:00
12 2015-04-30 300344 2015-05-01 00:17:00
13 2015-04-30 300148 2015-05-01 00:22:00
14 2015-04-30 100650 2015-05-01 00:22:00
15 2015-04-30 100799 2015-05-01 00:23:00
16 2015-04-30 100582 2015-05-01 00:26:00
17 2015-04-30 100887 2015-05-01 00:27:00
18 2015-04-30 100887 2015-05-01 00:30:00
19 2015-05-01 100746 2015-05-01 08:31:00
20 2015-05-01 100684 2015-05-01 08:33:00
21 2015-05-01 100073 2015-05-01 08:33:00
22 2015-05-01 100771 2015-05-01 08:47:00
23 2015-05-01 200011 2015-05-01 08:59:00
24 2015-05-01 100259 2015-05-01 09:07:00
25 2015-05-01 100631 2015-05-01 09:07:00
26 2015-05-01 100746 2015-05-01 09:07:00
27 2015-05-01 200032 2015-05-01 09:08:00
28 2015-05-01 100684 2015-05-01 09:09:00