Search code examples
pythonpandasrow

easy way to remove certain observations (time series) in pandas dataframe?


I have apandas dataframe where there is a column called "date", which is in the format of YYQQ, e.g. 1702.

The column is a time series. However, there are some missing observations. For example, below dataframe is disrupted at date 1702. I have multiple dataframes like this and each disrupted at different dates. I only want to get the dates after the disruption. In the example below, I would just want rows on and after 1702. Is there a easy and pythonic way to do it?

date
1301
1302
1303
1304
1401
1702
1703
1704
1801
1802

Solution

  • First, let's recreate your dataframe and create new columns to represent the year (YY) and quarter (QQ) for each respective date:

    import pandas as pd
    
    df = pd.DataFrame({'date': [1301,1302,1303,1304,1401,1702,1703,1704,1801,1802]})
    
    df[['YY','QQ']] = df['date'].astype(str).str.extract(r'(\d{2})(\d{2})')
    

    Now we have:

       date  YY  QQ
    0  1301  13  01
    1  1302  13  02
    2  1303  13  03
    3  1304  13  04
    4  1401  14  01
    5  1702  17  02
    6  1703  17  03
    7  1704  17  04
    8  1801  18  01
    9  1802  18  02
    

    Let's convert each quarter to a month in that quarter and then we can convert back to a datetime series and find the first time difference that exceeds 3 months. Finally, we can slice the dataframe according to the first index that satisfies this boolean mask (i.e. the disruption):

    df['MM'] = df['QQ'].astype(int)*3
    
    df['date_t'] = pd.to_datetime(df['YY'].astype(str)+df['MM'].astype(str), format='%y%m')
    
    df.iloc[df[df['date_t'].diff().astype('timedelta64[M]')>3].index[0]:]
    

    Yields:

       date  YY  QQ  MM     date_t
    5  1702  17  02   6 2017-06-01
    6  1703  17  03   9 2017-09-01
    7  1704  17  04  12 2017-12-01
    8  1801  18  01   3 2018-03-01
    9  1802  18  02   6 2018-06-01