Search code examples
pythonpandasdataframereplacedata-cleaning

Data Frame- remove special character


My DF looks like below:

id date
1  '  : 07/01/2020 23:25'
2  ': 07/02/2020'
3  ' 07/03/2020 23:25 1'
4  '07/04/2020'
5  '23:50 07/05/2020'
6  '07 06 2023'
7  '00:00 07 07 2023'

I need to remove all special characters and numbers that are between ':' so DF should looks like below

id date
1 07/01/2020 
2 07/02/2020 
3 07/03/2020 
4 07/04/2020 
5 07/05/2020 

I can't use a simple: df['date'].str.split(':').str[0] or df['date'].str.replace(":",'') beacuse I will lose correct values. Do you have idea how I could solve this?

Regards


Solution

  • Assuming you want to convert to datetime, you don't necessarily have to cleanup the string, you can pass exact=False to to_datetime:

    df['out'] = pd.to_datetime(df['date'], format='%d/%m/%Y', exact=False)
    

    Otherwise, for strings, use the (\d{2}/\d{2}/\d{4}) regex and str.extract

    df['clean'] = df['date'].str.extract(r'(\d{2}\/\d{2}\/\d{4})')
    

    Output:

       id                  date        out       clean
    0   1    : 07/01/2020 23:25 2020-01-07  07/01/2020
    1   2          : 07/02/2020 2020-02-07  07/02/2020
    2   3    07/03/2020 23:25 1 2020-03-07  07/03/2020
    3   4            07/04/2020 2020-04-07  07/04/2020
    4   5      23:50 07/05/2020 2020-05-07  07/05/2020
    
    updated example:

    Extract with space or / as separator, then replace the spaces by /.

    df['clean'] = (df['date']
                   .str.extract(r'(\d{2}[ /]\d{2}[ /]\d{4})',
                                expand=False)
                   .str.replace(' ', '/')
                  )
    

    Output:

       id                  date       clean
    0   1    : 07/01/2020 23:25  07/01/2020
    1   2          : 07/02/2020  07/02/2020
    2   3    07/03/2020 23:25 1  07/03/2020
    3   4            07/04/2020  07/04/2020
    4   5      23:50 07/05/2020  07/05/2020
    5   6            07 06 2023  07/06/2023
    6   7      00:00 07 07 2023  07/07/2023