Search code examples
pythonpandastime-seriesstring-to-datetime

Strange behavior from to_datetime()


I have really been having a tough time here.

My DataFrame looks like this

     Purchase_Date     Customer_ID  Gender  
0   2012-12-18 00:00:00   7223        F 
1   2012-12-20 00:00:00   7841        M     
2   2012-12-21 00:00:00   8374        F

My goal is to change the "Purchase Date" column from string to datetime object so that I can run a cohort analysis by applying this function to it:

      def get_month(x): return dt.datetime(x.year, x.month, 1)
      data['InvoiceMonth'] = data['Purchase_Date'].apply(get_month)
      grouping = data.groupby('Customer_ID')['InvoiceMonth']
      data['CohortMonth'] = grouping.transform('min')

the function returns error: 'str' object has no attribute 'year' I have tried the following functions and played with all arguments (dayfirst, yearfirst...)

data["Purchase_Date"] = pd.to_datetime(data["Purchase_Date"])
pd.to_datetime()
datetime.datetime.strptime()

I keep getting ValueError: day is out of range for month

Please help out


Solution

  • So, you were almost there:

    data["Purchase_Date"] = pd.to_datetime(data["Purchase_Date"])
    data['InvoiceMonth'] = data["Purchase_Date"].dt.strftime("%Y-%m-01")
    

    (Outputs month in object format - you can convert it to datetime by adding pd.to_datetime(...))

    Or alternatively - using your approach:

    data["Purchase_Date"] = pd.to_datetime(data["Purchase_Date"])
    
    import datetime as dt
    
    def get_month(x): return dt.datetime(x.year, x.month, 1)
    
    data['InvoiceMonth'] = data["Purchase_Date"].apply(get_month)
    

    (Outputs month as datetime)

    Both will return, although I would highly recommend the first option:

      Purchase_Date  Customer_ID Gender InvoiceMonth
    0    2012-12-18         7223      F   2012-12-01
    1    2012-12-20         7841      M   2012-12-01
    2    2012-12-21         8374      F   2012-12-01