I have apandas dataframe where there is a column called "date", which is in the format of YYQQ, e.g. 1702.
The column is a time series. However, there are some missing observations. For example, below dataframe is disrupted at date 1702. I have multiple dataframes like this and each disrupted at different dates. I only want to get the dates after the disruption. In the example below, I would just want rows on and after 1702. Is there a easy and pythonic way to do it?
date
1301
1302
1303
1304
1401
1702
1703
1704
1801
1802
First, let's recreate your dataframe and create new columns to represent the year (YY
) and quarter (QQ
) for each respective date:
import pandas as pd
df = pd.DataFrame({'date': [1301,1302,1303,1304,1401,1702,1703,1704,1801,1802]})
df[['YY','QQ']] = df['date'].astype(str).str.extract(r'(\d{2})(\d{2})')
Now we have:
date YY QQ
0 1301 13 01
1 1302 13 02
2 1303 13 03
3 1304 13 04
4 1401 14 01
5 1702 17 02
6 1703 17 03
7 1704 17 04
8 1801 18 01
9 1802 18 02
Let's convert each quarter to a month in that quarter and then we can convert back to a datetime
series and find the first time difference that exceeds 3 months. Finally, we can slice the dataframe according to the first index that satisfies this boolean mask (i.e. the disruption):
df['MM'] = df['QQ'].astype(int)*3
df['date_t'] = pd.to_datetime(df['YY'].astype(str)+df['MM'].astype(str), format='%y%m')
df.iloc[df[df['date_t'].diff().astype('timedelta64[M]')>3].index[0]:]
Yields:
date YY QQ MM date_t
5 1702 17 02 6 2017-06-01
6 1703 17 03 9 2017-09-01
7 1704 17 04 12 2017-12-01
8 1801 18 01 3 2018-03-01
9 1802 18 02 6 2018-06-01