Search code examples
pythonpandasfinance

Detect errant data over consecutive rows of a Python Pandas DataFrame


While operating on a Pandas DataFrame of quarterly earnings dates, and realizing I would want to do quarter-to-quarter (i.e. Q2 to Q3) comparisons, I realized I ought to be certain my data is ordered correctly and also complete.

So, 1. always the 'correct' quarter (q1->q2->q3->q4->q1->q2...) 2. and isn't missing any quarters (q1->q2->q4->q1->q2->q3) would not be good as missing a q3.

I already have a dataframe I load in, and figured it would make sense to check it in pandas rather than go back and check it preloading.

I came up with what I think are two slightly band-aid solutions to a problem but thought it might be interesting to post the problem and my solutions to see if anyone had some light to shed or found this interesting. It seems difficult to find Pandas information at times to take me from elementary to intermediate level proficiency.

I guess there is either a more elegant solution for this problem, and probably also a more widely applicable concept I could learn and apply here and also to other problems. With no further ado. Or is it ado?...

I have a set of financial earnings data. It looks something like this:

Index   Symbol  Time    Earning_Date    Year    Quarter Last_Quarter
0   AAPL    16:30:00    10/27/2015  2015    Q4  Q3
1   AAPL    16:30:00    7/21/2015   2015    Q3  Q2
2   AAPL    16:30:00    4/27/2015   2015    Q2  Q1
3   AAPL    16:30:00    1/27/2015   2015    Q1  Q4
4   AAPL    16:30:00    10/20/2014  2014    Q4  Q3
5   AAPL    16:30:00    7/22/2014   2014    Q3  Q2
6   AAPL    16:30:00    4/23/2014   2014    Q2  Q1
7   AAPL    16:30:00    1/27/2014   2014    Q1  Q4
8   AAPL    16:30:00    10/28/2013  2013    Q4  Q3
9   AAPL    16:30:00    7/23/2013   2013    Q3  Q2
10  AAPL    16:30:00    4/23/2013   2013    Q2  Q1
11  AAPL    16:30:00    1/23/2013   2013    Q1  Q4
12  AAPL    16:30:00    10/25/2012  2012    Q4  Q3
13  AAPL    16:30:00    7/24/2012   2012    Q3  Q2
14  AAPL    16:30:00    4/24/2012   2012    Q2  Q1
15  AAPL    16:30:00    1/24/2012   2012    Q1  Q4
16  AAPL    16:30:00    10/18/2011  2011    Q4  Q3
17  AAPL    16:30:00    7/19/2011   2011    Q3  Q2
18  AAPL    16:30:00    4/20/2011   2011    Q2  Q1
19  AAPL    16:30:00    1/18/2011   2011    Q1  NaN

First of all, full disclosure - I've already populated this DF with a 'solution' to the problem of appending Last_Quarter to each row - i simply used .shift(-1) to populate that. I'm sure that could've been done better- that data is only as essential to the DF as it helped me solve the problem in my two ways. But it's fine if we solve the problem without the Last_Quarter column. Hope that makes sense.

The bigger question is sort of cleaning for potentially missing or erroneous data. If a stock skipped an earnings quarter, or my data was otherwise corrupt, the sequence might read Index 2 Quarter Q2, Index 3 Quarter Q4, thus Q3 is skipped, and lots of assumptions might then be wrong. So I wanted to just make sure for all data Q1 follows Q4, Q2 follows Q1, Q3 follows Q2, Q4 follows Q3.

And just kick out an exception at the very least if the data is bad. Here are two solutions we came up with:

    accptbl_qtr_pr_tpls = [('Q3','Q4'),('Q4','Q1'),('Q1','Q2'),('Q2','Q3')]
    rows_that_pass = 0
    rows_total = len(self.df)
    print 'total rows', rows_total
    for accptbl_qtr_pr_tpl in accptbl_qtr_pr_tpls:
        foo = self.df.ix[(self.df['Last_Quarter'] == accptbl_qtr_pr_tpl[0]) & (self.df['Quarter'] == accptbl_qtr_pr_tpl[1])]
        rows_that_pass += len(foo)
    if rows_total != 1+rows_that_pass: # the + 1 is to account for NaN in earliest result last_quarter column
        print 'quarter issue!, exiting'

and we also came up with:

        if not (((self.df['Last_Quarter'] == 'Q1') & (self.df['Quarter'] == 'Q2')).any() and ((self.df['Last_Quarter'] == 'Q2') & (self.df['Quarter'] == 'Q3')).any() \
           and ((self.df['Last_Quarter'] == 'Q3') & (self.df['Quarter'] == 'Q4')).any() and ((self.df['Last_Quarter'] == 'Q4') & (self.df['Quarter'] == 'Q1')).any()):
            print "bad data"
        else:
            print 'good data'

Figured i would throw this up here and find out how clever we are or how ... much time we wasted solving a solved problem


Solution

  • I would write a function to return True or False based on a valid combination of values in Quarter and Last_Quarter, then create a new column with the valid status result by applying the function row-wise.

    This will give you the ability to take a slice of the DataFrame with only the good or bad rows.

    The function would look something like this:

    def check_quarters(row):
       # if either Quarter or Last_Quarter is NaN, return False
       if (row['Quarter'] != row['Quarter']) or (row['Last_Quarter'] != row['Last_Quarter']):
          return False
       # check for valid combination when Quarter is Q2 Q3 or Q4
       if int(row['Quarter'][1:2]) - 1 == int(row['Last_Quarter'][1:2]):
          return True
       # check for valid combination when Quarter is Q1
       elif int(row['Quarter'][1:2]) == 1 and int(row['Last_Quarter'][1:2]) == 4:
          return True
       else:
          return False
    

    Apply the function to create the new column:

    df['Valid_Quarters'] = df.apply(check_quarters, axis = 1)
    

    Now you can slice the DataFrame to get only the valid rows:

    df.loc[df['Valid_Quarters'],:]