So, I'm working with a data frame with dates stored as string in different formats (why do people do this??) and I need to turn them into datetime objects. thing is that they not only have different formats but also are in different languages. I think the only way to approach this is with a Regex
that matches each specific format and then pass it through as an argument on pd.to_datetime
function, but I simply couldn't make the regex match any string. I think Regex
doesn't work with datetime objects such as %d, %b, %Y, etc
. (I tried with an instance at a time and it didn't recognize them).
If I run pd.to_datetime
method, it shows an error because of those instances in Spanish or other languages (number 7 and 9). So I thought of creating a regex but I just cant get it to work. How can I approach to solve this?
I'm putting a really simplified example of the DF that includes the different kind of date formats since there are more than 800k instances.
a={'date/time': {0: 'Jan 1, 2020 5:27:21 PM PST',
1: 'Apr. 26, 2020 12:25:56 p.m. PDT',
2: 'May 6, 2020 6:25:16 a.m. PDT',
3: '11/01/2019 18:33:39 PST',
4: 'May 15, 2020 2:25:12 a.m. PDT',
5: '30/01/2019 18:35:23 PST',
6: '10 feb. 2020 21:23:39 GMT-8',
7: '28 abr. 2020 21:04:03 GMT-7',
8: 'Jan 2, 2020 2:01:54 AM PST',
9: '25 ago. 2020 16:07:35 GMT-7'},
'settlement id': {0: 12493053321,
1: 12493053321,
2: 12493053321,
3: 12493053321,
4: 12493053321,
5: 12493053321,
6: 12493053321,
7: 12493053321,
8: 12493053321,
9: 12493053321}}
import pandas as pd
b=pd.DataFrame(a)
b["fecha2"]=pd.to_datetime(b["date/time"])
thanks a lot!!!
If it's not too much work, you can make a string replacement dictionary. For example:
MONTH_DICT = {
'abr': 'apr',
'ago': 'aug',
}
date = df["date/time"].replace(MONTH_DICT, regex=True)
date = pd.to_datetime(date, errors="raise")
df["date"] = date
output:
date/time settlement id date
0 Jan 1, 2020 5:27:21 PM PST 12493053321 2020-01-01 17:27:21-08:00
1 Apr. 26, 2020 12:25:56 p.m. PDT 12493053321 2020-04-26 12:25:56-07:00
2 May 6, 2020 6:25:16 a.m. PDT 12493053321 2020-05-06 06:25:16-07:00
3 11/01/2019 18:33:39 PST 12493053321 2019-11-01 18:33:39-07:00
4 May 15, 2020 2:25:12 a.m. PDT 12493053321 2020-05-15 02:25:12-07:00
5 30/01/2019 18:35:23 PST 12493053321 2019-01-30 18:35:23-08:00
6 10 feb. 2020 21:23:39 GMT-8 12493053321 2020-02-10 21:23:39+08:00
7 28 abr. 2020 21:04:03 GMT-7 12493053321 2020-04-28 21:04:03+07:00
8 Jan 2, 2020 2:01:54 AM PST 12493053321 2020-01-02 02:01:54-08:00
9 25 ago. 2020 16:07:35 GMT-7 12493053321 2020-08-25 16:07:35+07:00
This will raise an error if it is unable to parse a date (rather than silently fail to convert). Note that this simple version will replace any appearance of substrings in the MONTH_DICT
keys that are found.