While operating on a Pandas DataFrame of quarterly earnings dates, and realizing I would want to do quarter-to-quarter (i.e. Q2 to Q3) comparisons, I realized I ought to be certain my data is ordered correctly and also complete.
So, 1. always the 'correct' quarter (q1->q2->q3->q4->q1->q2...) 2. and isn't missing any quarters (q1->q2->q4->q1->q2->q3) would not be good as missing a q3.
I already have a dataframe I load in, and figured it would make sense to check it in pandas rather than go back and check it preloading.
I came up with what I think are two slightly band-aid solutions to a problem but thought it might be interesting to post the problem and my solutions to see if anyone had some light to shed or found this interesting. It seems difficult to find Pandas information at times to take me from elementary to intermediate level proficiency.
I guess there is either a more elegant solution for this problem, and probably also a more widely applicable concept I could learn and apply here and also to other problems. With no further ado. Or is it ado?...
I have a set of financial earnings data. It looks something like this:
Index Symbol Time Earning_Date Year Quarter Last_Quarter
0 AAPL 16:30:00 10/27/2015 2015 Q4 Q3
1 AAPL 16:30:00 7/21/2015 2015 Q3 Q2
2 AAPL 16:30:00 4/27/2015 2015 Q2 Q1
3 AAPL 16:30:00 1/27/2015 2015 Q1 Q4
4 AAPL 16:30:00 10/20/2014 2014 Q4 Q3
5 AAPL 16:30:00 7/22/2014 2014 Q3 Q2
6 AAPL 16:30:00 4/23/2014 2014 Q2 Q1
7 AAPL 16:30:00 1/27/2014 2014 Q1 Q4
8 AAPL 16:30:00 10/28/2013 2013 Q4 Q3
9 AAPL 16:30:00 7/23/2013 2013 Q3 Q2
10 AAPL 16:30:00 4/23/2013 2013 Q2 Q1
11 AAPL 16:30:00 1/23/2013 2013 Q1 Q4
12 AAPL 16:30:00 10/25/2012 2012 Q4 Q3
13 AAPL 16:30:00 7/24/2012 2012 Q3 Q2
14 AAPL 16:30:00 4/24/2012 2012 Q2 Q1
15 AAPL 16:30:00 1/24/2012 2012 Q1 Q4
16 AAPL 16:30:00 10/18/2011 2011 Q4 Q3
17 AAPL 16:30:00 7/19/2011 2011 Q3 Q2
18 AAPL 16:30:00 4/20/2011 2011 Q2 Q1
19 AAPL 16:30:00 1/18/2011 2011 Q1 NaN
First of all, full disclosure - I've already populated this DF with a 'solution' to the problem of appending Last_Quarter to each row - i simply used .shift(-1) to populate that. I'm sure that could've been done better- that data is only as essential to the DF as it helped me solve the problem in my two ways. But it's fine if we solve the problem without the Last_Quarter column. Hope that makes sense.
The bigger question is sort of cleaning for potentially missing or erroneous data. If a stock skipped an earnings quarter, or my data was otherwise corrupt, the sequence might read Index 2 Quarter Q2, Index 3 Quarter Q4, thus Q3 is skipped, and lots of assumptions might then be wrong. So I wanted to just make sure for all data Q1 follows Q4, Q2 follows Q1, Q3 follows Q2, Q4 follows Q3.
And just kick out an exception at the very least if the data is bad. Here are two solutions we came up with:
accptbl_qtr_pr_tpls = [('Q3','Q4'),('Q4','Q1'),('Q1','Q2'),('Q2','Q3')]
rows_that_pass = 0
rows_total = len(self.df)
print 'total rows', rows_total
for accptbl_qtr_pr_tpl in accptbl_qtr_pr_tpls:
foo = self.df.ix[(self.df['Last_Quarter'] == accptbl_qtr_pr_tpl[0]) & (self.df['Quarter'] == accptbl_qtr_pr_tpl[1])]
rows_that_pass += len(foo)
if rows_total != 1+rows_that_pass: # the + 1 is to account for NaN in earliest result last_quarter column
print 'quarter issue!, exiting'
and we also came up with:
if not (((self.df['Last_Quarter'] == 'Q1') & (self.df['Quarter'] == 'Q2')).any() and ((self.df['Last_Quarter'] == 'Q2') & (self.df['Quarter'] == 'Q3')).any() \
and ((self.df['Last_Quarter'] == 'Q3') & (self.df['Quarter'] == 'Q4')).any() and ((self.df['Last_Quarter'] == 'Q4') & (self.df['Quarter'] == 'Q1')).any()):
print "bad data"
else:
print 'good data'
Figured i would throw this up here and find out how clever we are or how ... much time we wasted solving a solved problem
I would write a function to return True or False based on a valid combination of values in Quarter and Last_Quarter, then create a new column with the valid status result by applying the function row-wise.
This will give you the ability to take a slice of the DataFrame with only the good or bad rows.
The function would look something like this:
def check_quarters(row):
# if either Quarter or Last_Quarter is NaN, return False
if (row['Quarter'] != row['Quarter']) or (row['Last_Quarter'] != row['Last_Quarter']):
return False
# check for valid combination when Quarter is Q2 Q3 or Q4
if int(row['Quarter'][1:2]) - 1 == int(row['Last_Quarter'][1:2]):
return True
# check for valid combination when Quarter is Q1
elif int(row['Quarter'][1:2]) == 1 and int(row['Last_Quarter'][1:2]) == 4:
return True
else:
return False
Apply the function to create the new column:
df['Valid_Quarters'] = df.apply(check_quarters, axis = 1)
Now you can slice the DataFrame to get only the valid rows:
df.loc[df['Valid_Quarters'],:]