Search code examples
pandasdataframedatetimetime-series

How to check for typos in a time series dataframe?


I have a user-input time series dataframe whose date column supposedly consists of all consecutive Mondays and Thursdays during an extended period of time stretched over few years. Specifically, "date" column (which is the first column) has either one of these formats: "Monday mm/dd/yyyy" or "Thursday mm/dd/yyyy". (The first few rows of the series are shown as an example even though there is no typo. One can imagine that having a much much longer period of time stretched over few years, it is more likely to encounter typos)

Monday 07/31/2023
Thursday 07/27/2023
Monday 07/24/2023
Thursday 07/20/2023
Monday 07/17/2023
Friday 07/13/2023
Thursday 07/10/2023
Monday 07/06/2023
Thursday 07/03/2023
Monday 06/29/2023

I would like to check this column against the following tests as concise as possible:

  1. day portion of rows are alternating; no single day is repeated among any two neighboring rows.
  2. There is a difference of 3 or 4 days between the date portion of any two neighboring rows.
  3. In each row, day and date match based on the calendar; whether a specific datetime object, mm/dd/yyyy is actually a "Monday" or a "Thursday".

Solution

  • You can try:

    tmp = df['date'].str.split(expand=True)
    
    # are the dates valid?
    check1 = (pd.to_datetime(df['date'], format='%A %m/%d/%Y')
                .dt.strftime('%A %m/%d/%Y')
                .eq(df['date']).all()
              )
    
    # are the days a succession of Monday/Thursday?
    check2 = tmp[0].eq(np.tile(['Monday', 'Thursday'], len(tmp)//2)[:len(tmp)]).all()
    
    # are the time differences a succession of alternating 3/4 days?
    s = pd.to_datetime(tmp[1], format='%m/%d/%Y').diff(-1).iloc[:-1]
    check3 = (    (set(s.iloc[:2].dt.days) == {3, 4})
              and s.iloc[2:].eq(s.shift(2).iloc[2:]).all()
             )
    
    print(check1, check2, check3)
    # False False True
    

    Used input:

    df = pd.DataFrame({'date': ['Monday 07/31/2023',
                                'Thursday 07/27/2023',
                                'Monday 07/24/2023',
                                'Thursday 07/20/2023',
                                'Monday 07/17/2023',
                                'Friday 07/13/2023',
                                'Thursday 07/10/2023',
                                'Monday 07/06/2023',
                                'Thursday 07/03/2023',
                                'Monday 06/29/2023']})
    

    If you want to identify the offending rows:

    tmp = df['date'].str.split(expand=True)
    
    # are the dates valid?
    check1 = (pd.to_datetime(df['date'], format='%A %m/%d/%Y')
                .dt.strftime('%A %m/%d/%Y')
                .eq(df['date'])
              )
    
    def alternating(s, cat):
        s = pd.Series(pd.Categorical(s, categories=cat).codes, index=s.index)
        N = len(cat)-1
        return s.eq((s+N).mod(N+1).shift(N).fillna(s)) & s.between(0, N)
    
    # are the days a succession of Monday/Thursday?
    # this flags False if the current or previous day in not alternating M/T
    check2 = alternating(tmp[0], cat=['Monday', 'Thursday'])
    
    # are the time differences a succession of alternating 3/4 days?
    check3 = alternating(pd.to_datetime(tmp[1], format='%m/%d/%Y').diff(-1).dt.days.iloc[:-1],
                         cat=[3, 4]
                        ).reindex(tmp.index, method='ffill')
    
    
    df['valid_date?'] = check1
    df['Mon/Thu?'] = check2
    df['3/4 days'] = check3
    df['invalid?'] = ~(check1 & check2 & check3)
    
    print(df)
    

    Output:

                      date  valid_date?  Mon/Thu?  3/4 days  invalid?
    0    Monday 07/31/2023         True      True      True     False
    1  Thursday 07/27/2023         True      True      True     False
    2    Monday 07/24/2023         True      True      True     False
    3  Thursday 07/20/2023         True      True      True     False
    4    Monday 07/17/2023         True      True      True     False
    5    Friday 07/13/2023        False     False      True      True
    6  Thursday 07/10/2023        False     False      True      True
    7    Monday 07/06/2023        False      True      True      True
    8  Thursday 07/03/2023        False      True      True      True
    9    Monday 06/29/2023        False      True      True      True