Search code examples
pythondatetimepandasdataframesklearn-pandas

Converting dataframe column of years to month day year


I'm doing this for homework.

My goal is to have an entirely new column with just the days elapsed. There are 500,000+ rows of this...so my goal is to:

  1. In the Pandas dataframe, I have these two date columns which are in different formats. I'd like to subtract these two columns, and then create a new 'Days Elapsed' column which is a simple integer list.
  2. Output into new CSV (this code is done)
  3. Now I can completely avoid parsing dates every time I rework the code/read the CSV because it's taking a crazy long time and slowing my work down.

I'm trying to convert this:

   Yearmade         Saledate
0      2004  11/16/2006 0:00
1      1996   3/26/2004 0:00
2      2001   2/26/2004 0:00

Into:

       Days Elapsed
0      1050
1      3007
2      1151

Current attempt:

year_mean = df[df['YearMade'] > 1000]['YearMade'].mean()
df.loc[df['YearMade'] == 1000, 'YearMade'] = year_mean
## There's lots of erroneous data of the year 1000, so replacing all of them with the mean of the column (mean of column without error data, that is)
df['Yearmade'] = "1/1/"+df['YearMade'].astype(str)
## This is where it errors out.
df['Yearmade'] = pd.to_datetime(df['Yearmade'])
df['Saledate'] = pd.to_datetime(df['Saledate'])
df['Age_at_Sale'] = df['Saledate'].sub(df['Yearmade'])
df = df.drop(['Saledate', 'Yearmade'], axis=1)

[then there's another class method to convert the current df into csv]

Solution

  • assuming you have the following DF:

    In [203]: df
    Out[203]:
       Yearmade   Saledate
    0      2004 2006-11-16
    1      1996 2004-03-26
    2      2001 2004-02-26
    3      1000 2003-12-23     # <--- erroneous year 
    

    Solution:

    In [204]: df.loc[df.Yearmade <= 1900, 'Yearmade'] = round(df.Yearmade.loc[df.Yearmade > 1900].mean())
    
    In [205]: df
    Out[205]:
       Yearmade   Saledate
    0      2004 2006-11-16
    1      1996 2004-03-26
    2      2001 2004-02-26
    3      2000 2003-12-23    # <--- replaced with avg. year 
    
    In [206]: df['days'] = (pd.to_datetime(Saledate) - pd.to_datetime(df.Yearmade, format='%Y')).dt.days
    
    In [207]: df
    Out[207]:
       Yearmade   Saledate  days
    0      2004 2006-11-16  1050
    1      1996 2004-03-26  3007
    2      2001 2004-02-26  1151
    3      2000 2003-12-23  1452