My DF looks like below:
id date
1 21 July 2023 (abcd)
2 22 July 2023 00:00:01
3 23 July 2023 -abcda
I need to remove all after year (2023) but I want to keep it. So the result should be:
id date
1 21 July 2023
2 22 July 2023
3 23 July 2023
I used this but I can't keep information about year
df['date'].str.rsplit('2023', 1).str.get(0)
I can't add year '2023' to the string that would left after this operation because the year can change. But I can deal with this. I just need to get the result.
Regards Tomasz
You can use the following regex with str.replace
to remove everything after the 4 digits of the year:
df['date'] = df['date'].str.replace(r'(?<=\b\d{4}\b).*', '', regex=True)
Or with str.extract
to match digits, letters, and 4 digits:
df['date'] = df['date'].str.extract(r'(\d+ [a-zA-Z]+ \d{4})')
NB. if you only want to split after 2023 and not any 4 digits year, then replace \d{4}
by 2023
.
Output:
id date
0 1 21 July 2023
1 2 22 July 2023
2 3 23 July 2023
A variant of your original approach would have been to split with a regex lookbehind, but it's less efficient since you need 2 str
operations:
df['date'] = df['date'].str.split(r'(?<=2023)', regex=True).str.get(0)