Search code examples
pythonpandasdataframedatesplit

Data Frame- Remove all after year but keep information about year


My DF looks like below:

id date
1  21 July 2023 (abcd)
2  22 July 2023 00:00:01
3  23 July 2023 -abcda

I need to remove all after year (2023) but I want to keep it. So the result should be:

id date
1  21 July 2023
2  22 July 2023
3  23 July 2023

I used this but I can't keep information about year

df['date'].str.rsplit('2023', 1).str.get(0)

I can't add year '2023' to the string that would left after this operation because the year can change. But I can deal with this. I just need to get the result.

Regards Tomasz


Solution

  • You can use the following regex with str.replace to remove everything after the 4 digits of the year:

    df['date'] = df['date'].str.replace(r'(?<=\b\d{4}\b).*', '', regex=True)
    

    regex demo

    Or with str.extract to match digits, letters, and 4 digits:

    df['date'] = df['date'].str.extract(r'(\d+ [a-zA-Z]+ \d{4})')
    

    regex demo

    NB. if you only want to split after 2023 and not any 4 digits year, then replace \d{4} by 2023.

    Output:

       id          date
    0   1  21 July 2023
    1   2  22 July 2023
    2   3  23 July 2023
    

    A variant of your original approach would have been to split with a regex lookbehind, but it's less efficient since you need 2 str operations:

    df['date'] = df['date'].str.split(r'(?<=2023)', regex=True).str.get(0)