I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect