Search code examples
pythonregexpandasdatetimestrptime

Convert String to Datetime objects with different formats python


So, I'm working with a data frame with dates stored as string in different formats (why do people do this??) and I need to turn them into datetime objects. thing is that they not only have different formats but also are in different languages. I think the only way to approach this is with a Regex that matches each specific format and then pass it through as an argument on pd.to_datetime function, but I simply couldn't make the regex match any string. I think Regex doesn't work with datetime objects such as %d, %b, %Y, etc. (I tried with an instance at a time and it didn't recognize them).

If I run pd.to_datetime method, it shows an error because of those instances in Spanish or other languages (number 7 and 9). So I thought of creating a regex but I just cant get it to work. How can I approach to solve this?

I'm putting a really simplified example of the DF that includes the different kind of date formats since there are more than 800k instances.

    a={'date/time': {0: 'Jan 1, 2020 5:27:21 PM PST',
  1: 'Apr. 26, 2020 12:25:56 p.m. PDT',
  2: 'May 6, 2020 6:25:16 a.m. PDT',
  3: '11/01/2019 18:33:39 PST',
  4: 'May 15, 2020 2:25:12 a.m. PDT',
  5: '30/01/2019 18:35:23 PST',
  6: '10 feb. 2020 21:23:39 GMT-8',
  7: '28 abr. 2020 21:04:03 GMT-7',
  8: 'Jan 2, 2020 2:01:54 AM PST',
  9: '25 ago. 2020 16:07:35 GMT-7'},
 'settlement id': {0: 12493053321,
  1: 12493053321,
  2: 12493053321,
  3: 12493053321,
  4: 12493053321,
  5: 12493053321,
  6: 12493053321,
  7: 12493053321,
  8: 12493053321,
  9: 12493053321}}

import pandas as pd

b=pd.DataFrame(a)
b["fecha2"]=pd.to_datetime(b["date/time"])

thanks a lot!!!


Solution

  • If it's not too much work, you can make a string replacement dictionary. For example:

    MONTH_DICT = {
        'abr': 'apr', 
        'ago': 'aug',
    }
    
    date = df["date/time"].replace(MONTH_DICT, regex=True)
    date = pd.to_datetime(date, errors="raise")
    df["date"] = date
    

    output:

                             date/time  settlement id                       date
    0       Jan 1, 2020 5:27:21 PM PST    12493053321  2020-01-01 17:27:21-08:00
    1  Apr. 26, 2020 12:25:56 p.m. PDT    12493053321  2020-04-26 12:25:56-07:00
    2     May 6, 2020 6:25:16 a.m. PDT    12493053321  2020-05-06 06:25:16-07:00
    3          11/01/2019 18:33:39 PST    12493053321  2019-11-01 18:33:39-07:00
    4    May 15, 2020 2:25:12 a.m. PDT    12493053321  2020-05-15 02:25:12-07:00
    5          30/01/2019 18:35:23 PST    12493053321  2019-01-30 18:35:23-08:00
    6      10 feb. 2020 21:23:39 GMT-8    12493053321  2020-02-10 21:23:39+08:00
    7      28 abr. 2020 21:04:03 GMT-7    12493053321  2020-04-28 21:04:03+07:00
    8       Jan 2, 2020 2:01:54 AM PST    12493053321  2020-01-02 02:01:54-08:00
    9      25 ago. 2020 16:07:35 GMT-7    12493053321  2020-08-25 16:07:35+07:00
    

    This will raise an error if it is unable to parse a date (rather than silently fail to convert). Note that this simple version will replace any appearance of substrings in the MONTH_DICT keys that are found.