Search code examples
pythondatefor-loopdata-cleaning

Compare each pair of dates in two columns in python efficiently


I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.

I have data in the form of:

    Start       End
0   2008-10-01  2008-10-31  
1   2006-07-01  2006-12-31  
2   2000-05-01  2002-12-31  
3   1971-08-01  1973-12-31  
4   1969-01-01  1969-12-31  

I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:

dates['Correct'] = " "

And have began to check each date pair using the following, where my dataframe is called dates:

for index, row in dates.iterrows():
    if dates.Start[index] < dates.End[index]:
        dates.Correct[index] = "correct"
    elif dates.Start[index] == dates.End[index]:
        dates.Correct[index] = "same"
    elif dates.Start[index] > dates.End[index]:
        dates.Correct[index] = "incorrect"

Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?


Solution

  • Why not just do it in a vectorized way:

    is_correct = dates['Start'] < dates['End']
    is_incorrect = dates['Start'] > dates['End']
    is_same = ~is_correct & ~is_incorrect