I have one pandas Dataframe like below:
name,year
AAA,2015-11-02 22:00:00
AAA,2015-11-02 23:00:00
AAA,2015-11-03 00:00:00
AAA,2015-11-03 01:00:00
AAA,2015-11-03 02:00:00
AAA,2015-11-03 05:00:00
ZZZ,2015-09-01 00:00:00
ZZZ,2015-11-01 01:00:00
ZZZ,2015-11-01 07:00:00
ZZZ,2015-11-01 08:00:00
ZZZ,2015-11-01 09:00:00
ZZZ,2015-11-01 12:00:00
I want to find out the gaps available in the year column of the dataframe with respect to particular name. For example,
I want to generate two csv files with contents:
CSV-1:
name,year
AAA,2015-11-02 22:00:00,0
AAA,2015-11-02 23:00:00,0
AAA,2015-11-03 00:00:00,0
AAA,2015-11-03 01:00:00,0
AAA,2015-11-03 02:00:00,2
AAA,2015-11-03 05:00:00,0
ZZZ,2015-09-01 00:00:00,0
ZZZ,2015-11-01 01:00:00,5
ZZZ,2015-11-01 07:00:00,0
ZZZ,2015-11-01 08:00:00,0
ZZZ,2015-11-01 09:00:00,2
ZZZ,2015-11-01 12:00:00,0
CSV-2:
name,prev_year,next_year,gaps
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 03:00:00
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 02:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 03:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 05:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 06:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 10:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 11:00:00
I tried like below:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
mask = df.groupby("name").year.diff() > pd.Timedelta('0 days 01:00:00')
To get your gap into your dataframe you need to reassign the mask
that you generate. To get this in terms of total hours, you can simply divide by 1 hour:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = (df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).fillna(0)
This gives us the following dataframe:
name year Gap
0 AAA 2015-11-02 22:00:00 0.0
1 AAA 2015-11-02 23:00:00 1.0
2 AAA 2015-11-03 00:00:00 1.0
3 AAA 2015-11-03 01:00:00 1.0
4 AAA 2015-11-03 02:00:00 1.0
5 AAA 2015-11-03 05:00:00 3.0
6 ZZZ 2015-09-01 00:00:00 0.0
7 ZZZ 2015-11-01 07:00:00 6.0
8 ZZZ 2015-11-01 08:00:00 1.0
9 ZZZ 2015-11-01 09:00:00 1.0
10 ZZZ 2015-11-01 12:00:00 3.0
To get the gap next to its start time and consistent with the way you want it for "csv-1", we simply shift this up one row and subtract one before we fill the na values:
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
This gets:
name year Gap
0 AAA 2015-11-02 22:00:00 0.0
1 AAA 2015-11-02 23:00:00 0.0
2 AAA 2015-11-03 00:00:00 0.0
3 AAA 2015-11-03 01:00:00 0.0
4 AAA 2015-11-03 02:00:00 2.0
5 AAA 2015-11-03 05:00:00 0.0
6 ZZZ 2015-11-01 01:00:00 5.0
7 ZZZ 2015-11-01 07:00:00 0.0
8 ZZZ 2015-11-01 08:00:00 0.0
9 ZZZ 2015-11-01 09:00:00 2.0
10 ZZZ 2015-11-01 12:00:00 0.0
In order to get your second csv, we can do the following:
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)
df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
.resample(rule='1H')\
.ffill()\
.reset_index()
gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
First we set up the "before" and "after" columns. Then by changing the index to 'year'
, we can use the .resample()
method to fill in all of our missing hours. by using ffill()
when we resample, we copy down the last available record into all the new rows we add. We know that when 'prev_year' != 'year'
, we're on a row that didn't previously exist in the frame, and thus is one of the gaps, so we filter to just those rows, select the columns we need and rename them. This gives:
name prev_year next_year year
5 AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 03:00:00
6 AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 04:00:00
9 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 02:00:00
10 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 03:00:00
11 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 04:00:00
12 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 05:00:00
13 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 06:00:00
17 ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 10:00:00
18 ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 11:00:00
To summarise, your script could look as follows:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
df.to_csv('csv-1.csv', index=False)
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)
df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
.resample(rule='1H')\
.ffill()\
.reset_index()
gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
gaps.to_csv('csv-2.csv', index=False)