Search code examples
pythonpandasdatetime-seriesdata-cleaning

pd.to_datetime converting mixed object feature column value to NAT values but intend to extract month


I've been trying to process a column date in the Dataframe to obtain the month in int type using pd.to_datetime.

This is the code in Python using pandas .

print(df["date"].head())
0       Oct
1       Jun
2    15-Oct
3    27-Nov
4    26-Sep
Name: date, dtype: object

After attempting to convert to datetime,I obtained all values in NAT. How do I fix this?

df["date"]=pd.to_datetime(df["date"],errors='coerce')
print(df["date"].head())

I get:

0   NaT
1   NaT
2   NaT
3   NaT
4   NaT

Name: date, dtype: datetime64[ns]

Running isNA returns all values at NAT:

print(df["date"].isna().sum())
1000

I plan to obtain:

0       10
1       06
2       10
3       11
4       09

For values that can't be converted to datetime and then int(because values are missing or unrecognisable) I plan to replace with "Date not given"

What do I need to do?


Solution

  • Use Series.str.extract with Series.map:

    d = {'Jan':'01', 'Feb':'02','Mar':'03', 'Apr':'04', 
         'May':'05','Jun':'06', 'Jul':'07','Aug':'08',
         'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
    
    df["date1"] = df["date"].str.extract(r'([A-Za-z]+)', expand=False).map(d)
    

    Or convert values to datetimes with %b for match months and convert to strings by Series.dt.strftime:

    df["date2"] = pd.to_datetime(df["date"].str.extract(r'([A-Za-z]+)', expand=False), 
                                 format='%b', errors='coerce').dt.strftime('%m')
    print (df)
         date date1 date2
    0     Oct    10    10
    1     Jun    06    06
    2  15-Oct    10    10
    3  27-Nov    11    11
    4  26-Sep    09    09
    

    If need integers:

    print (df)
         date
    0    Ocyt
    1     Jun
    2  15-Oct
    3  27-Nov
    4  26-Sep
    
    
    df["date2"] = (pd.to_datetime(df["date"].str.extract(r'([A-Za-z]+)', expand=False), 
                                  format='%b', errors='coerce')
                     .dt.month.astype('Int64'))
    print (df)
         date  date2
    0    Ocyt   <NA>
    1     Jun      6
    2  15-Oct     10
    3  27-Nov     11
    4  26-Sep      9