Search code examples
pythonpandasstring-to-datetime

Pandas Converting date string (only month and year) to datetime


I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:

                       DATE
Out[3]    0           081972
          1           051967
          2           101964
          3           041975
          4           071976

I tried:

df['DATE'].apply(pd.to_datetime(format='%m%Y'))

and

pd.to_datetime(df['DATE'],format='%m%Y')

I got Runtime Error both times

Then

df['DATE'].apply(pd.to_datetime)

it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.

            DATE
0       1972-08-19
1       2067-05-19
2       2064-10-19
3       1975-04-19
4       1976-07-19

Expect output:

          DATE
0       1972-08
1       1967-05
2       1964-10
3       1975-04
4       1976-07

If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.

Thank you all in advance for your help


Solution

  • First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:

    print (df)
         DATE
    0   81972
    1   51967
    2  101964
    3   41975
    4  171976 <-changed data
    
    
    print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
    0   1972-08-01
    1   1967-05-01
    2   1964-10-01
    3   1975-04-01
    4          NaT
    Name: DATE, dtype: datetime64[ns]
    
    print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
         DATE
    4  171976
    

    Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:

    df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
    print (df)
          DATE
    0  1972-08
    1  1967-05
    2  1964-10
    3  1975-04
    4      NaT
    

    Solution with original data:

    df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
    print (df)
    
    0  1972-08
    1  1967-05
    2  1964-10
    3  1975-04
    4  1976-07