Search code examples
pythonpandaspython-dateutil

speed up parse function dateutil


Using dateutil I created function which check if data from csv is date and if not remove it from dataframe, but I have problem with speed, similar checking for string or int takes 1-2 seconds for hundreds of thousands rows, but this takes more than 1 minute for date checking. I need some tips, how to speed up this, maybe there is some function from pandas library

df = pd.read_csv(filename, delimiter='|', dtype=str)
for i, ColumnValueDate in enumerate(df[column]):
     try:
        df.loc[i, column[0]] = parse(str(ColumnValueDate)).strftime("%Y-%m-%d")
    except ParserError as ex:
        dataframeIndexToDelete.append(i)
        print(ex)
df = df.drop(dataframeIndexToDelete)

Solution

  • You could just use to_datetime on the column, with errors='coerce' and then remove all values which are NaT (not a time):

    df[column] = pd.to_datetime(df[column], errors='coerce')
    df = df[~df[column].isna()]
    

    If you don't want to modify the dataframe, use a temporary Series instead:

    dates = pd.to_datetime(df[column], errors='coerce')
    df = df[~dates.isna()]
    

    Update

    Based on comments, df[column] could have NaN or empty ('') values which should not be excluded. This requires making a second mask to ensure they are in the output even though the date conversion will fail:

    m1 = (df['Date'].isna()) | (df['Date'] == '')
    m2 = ~pd.to_datetime(df['Date'], errors='coerce').isna()
    
    df = df[m1 | m2]