Search code examples
pythonpandasdateextract

Keep Date Values Only When a Pandas DataFrame Column Includes New Lines, non-words


I have imported the below sample data set in a pd dataframe.
My plan is to generate an output which looks like "wants" from "have".
In other words, I am trying to pick up date values only when it is mixed with time components and some unnecessary non-words.
I tried datetime.date but seems like not working since it only works for datetime format.
I have tried using several regex but they also removes hypens and leave a space between year, month, day.
What would be a good way to address this? Any help would be appreciated.

have

ID         Date_time  
210        01-01-2016\r\n01:07 PM       
205        01-06-2017\r\n01:10 PM            
...           ...                
1504       ââ¬Å½30-10-2014\r\n01:15 AM      
1544       ââ¬Å½11-10-2018\r\n05:38 AM          

wants

ID         Date_time  
210        2016-01-01      
205        2017-06-01            
...           ...                
1504       2014-10-30
1544       2018-10-11


Solution

  • Use str.extract combined with pandas.to_datetime:

    df['Date_time'] = pd.to_datetime(df['Date_time'].str.extract('(\d{2}-\d{2}-\d{4})',
                                     expand=False), format='%d-%m-%Y')
    print(df)
    
    # Output
         ID  Date_time
    0   210 2016-01-01
    1   205 2017-06-01
    2  1504 2014-10-30
    3  1544 2018-10-11