Search code examples
pythonpandasdatetimedateparser

Python / Pandas - Parsing Date-Time-Formats using dateparser different Locales


This is my first post here, so feedback on how to better convey my questions is also appreciated.

Basically I want to apply a date parser on a folder of .csv-files. Problem is: Files can contain different data-time formats (locales).

I chose dateparser, since it can handle different locales "out of the box". Locales in files are: german, english, french, italian, spanish, swedish, polish and netherlands. A single file only has ONE distinct dateformat. File one is german, file 2 is english and so on.

I almost got this to work using dateparser, but for some reason it breaks german dates, other files are processed without issues. 01.08.2021 (Should be 1st of August, becomes January 8th)

Sample-Dates ( all are DD/MM/YYYY HH/MM/SS in structure)

uk = "31 Jul 2021 23:07:35"
fr = "31 juil. 2021 22:36:38"
it = "31 lug 2021 22:14:56"
es = "31 jul. 2021 22:08:08"
de = "01.08.2021 22:15:54"
se = "2 jan. 2022 07:12:44"
nl = "31 jul. 2021 22:04:02"
pl = "2 lis 2021 08:27:54"

My code:

import os, dateparser, pandas as pd

dateien = []

marketplaces = []

quelle = "CSVs/header_inject/"

counter = 0

for subdir, dirs, files in os.walk(quelle):
    for file in files:
        dateien.append(file)

for input_file in dateien:
    df = pd.read_csv(quelle+input_file, header=0, parse_dates=['Datum'], date_parser=dateparser.parse, index_col=False ,dtype=str, low_memory=False,sep=",")
    counter +=1
    df.to_csv("CSVs/Transactions_format/"+str(counter)+".csv", index=False)

Adding "dayfirst=True" while loading the file did nothing for me.


Solution

  • Out of the box, dateparser assumes MDY order; from the docs:


    OOTB Language Based Date Order Preference

    >>> # parsing ambiguous date
    >>> parse('02-03-2016')  # assumes english language, uses MDY date order
    datetime.datetime(2016, 2, 3, 0, 0)
    >>> parse('le 02-03-2016')  # detects french, uses DMY date order
    datetime.datetime(2016, 3, 2, 0, 0)
    

    If you parse dates in a second step instead of during read_csv, you can supply settings to dateparser, EX:

    import dateparser   
    import pandas as pd
    
    df = pd.DataFrame({'dates': (uk,fr,it,es,de,se,nl,pl)})
    
    df['datesparsed'] = df['dates'].apply(dateparser.parse, settings={'DATE_ORDER': 'DMY'})
    
    # df
    #                     dates         datesparsed
    # 0    31 Jul 2021 23:07:35 2021-07-31 23:07:35
    # 1  31 juil. 2021 22:36:38 2021-07-31 22:36:38
    # 2    31 lug 2021 22:14:56 2021-07-31 22:14:56
    # 3   31 jul. 2021 22:08:08 2021-07-31 22:08:08
    # 4     01.08.2021 22:15:54 2021-08-01 22:15:54
    # 5    2 jan. 2022 07:12:44 2022-01-02 07:12:44
    # 6   31 jul. 2021 22:04:02 2021-07-31 22:04:02
    # 7     2 lis 2021 08:27:54 2021-11-02 08:27:54
    

    ( ! ) Note however that now all ambiguous dates will be interpreted as DMY.