Data Frame- Remove all after year but keep information about year

My DF looks like below:

id date
1  21 July 2023 (abcd)
2  22 July 2023 00:00:01
3  23 July 2023 -abcda

I need to remove all after year (2023) but I want to keep it. So the result should be:

id date
1  21 July 2023
2  22 July 2023
3  23 July 2023

I used this but I can't keep information about year

df['date'].str.rsplit('2023', 1).str.get(0)

I can't add year '2023' to the string that would left after this operation because the year can change. But I can deal with this. I just need to get the result.

Regards Tomasz

Solution

You can use the following regex with str.replace to remove everything after the 4 digits of the year:

df['date'] = df['date'].str.replace(r'(?<=\b\d{4}\b).*', '', regex=True)

regex demo

Or with str.extract to match digits, letters, and 4 digits:

df['date'] = df['date'].str.extract(r'(\d+ [a-zA-Z]+ \d{4})')

regex demo

NB. if you only want to split after 2023 and not any 4 digits year, then replace \d{4} by 2023.

Output:

   id          date
0   1  21 July 2023
1   2  22 July 2023
2   3  23 July 2023

A variant of your original approach would have been to split with a regex lookbehind, but it's less efficient since you need 2 str operations:

df['date'] = df['date'].str.split(r'(?<=2023)', regex=True).str.get(0)