Search code examples
pythonpandasapplydata-cleaning

Pass row numbers while using apply() function in a column in Pandas


So basically what I am trying to do is to format date column. Dates are given as : 24th Mar, 5th Jul and so on. I wrote a function to split these and make it like 24/03 and 05/07. But the problem is that for rows 0 to 8 in my pandas data frame it is for 2021 and rest of the rows is for 2020. So basically with the current code I can make 24th Mar to 24/03 but I want it to be 24/03/2021 if row number is between 0 to 8 and 24/03/2021 if row number is after 8.

import operator

def dateConversion(date):
  day =''
  month = ''
  val_month = 0
  if operator.contains(date, "th"): 
    day, month = date.split("th")
  if operator.contains(date, "rd"): 
    day, month = date.split("rd")
  if operator.contains(date, "nd"):
    day, month = date.split("nd")
  if operator.contains(date, "st"):
    day, month = date.split("st")
  
  day = day.strip()

  if(int(day) < 10):
    day = str(day)
    day = '0' + day

  month = month.strip()
  months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] #list items based on data
  if month in months:
    val_month = months.index(month) + 1 #getting month value
  if(val_month < 10):
    val_month = str(val_month)
    val_month = '0' + val_month
  day = day + '/' + str(val_month) #+ '/' + year
  return day

And I have used below code to apply to column :

df_ipo['Listed Date_'] = df_ipo['Listed Date'].apply(lambda x: dateConversion(x))

How can I pass the row number as well with this apply function in dateConversion so that I can set year accordingly.


Solution

  • Instead of row-wise parsing the dates (which can be slow) we can replace the abbreviations, and add the years to the first few rows. This then allows us to easily convert to a datetime dtype which has the ability easily format the dates into your strings with strftime.

    I left all 4 columns (the original, with the abbreviations stripped and years added, the datetime and the string formatted date) for clarity, but there's no real reason to create all of these separate columns.

    Sample Data

    import pandas as pd
    df = pd.DataFrame({'date': ['24th Mar', '5th Jul', '1st May', '2nd Jun', '3rd Jul',
                                '30th May', '21st Oct', '18th Dec', '5th Sep', '16th Dec']})
    

    repl = {'1st': '1', '2nd': '2', '3rd': '3', '4th': '4', '5th': '5', 
            '6th': '6', '7th': '7', '8th': '8', '9th': '9', '0th': '0'}
    
    df['date_strip'] = df['date'].replace(repl, regex=True)
    
    # Add 2021 to first 8 rows, then 2020 to the rest
    df['date_strip'] = df['date_strip'] + ' ' + (['2021']*8 + ['2020']*(len(df)-8))
    
    df['datetime'] = pd.to_datetime(df.date_strip, format='%d %b %Y')
    df['date_fmt'] = df['datetime'].dt.strftime('%d/%m/%y')
    

           date   date_strip   datetime  date_fmt
    0  24th Mar  24 Mar 2021 2021-03-24  24/03/21
    1   5th Jul   5 Jul 2021 2021-07-05  05/07/21
    2   1st May   1 May 2021 2021-05-01  01/05/21
    3   2nd Jun   2 Jun 2021 2021-06-02  02/06/21
    4   3rd Jul   3 Jul 2021 2021-07-03  03/07/21
    5  30th May  30 May 2021 2021-05-30  30/05/21
    6  21st Oct  21 Oct 2021 2021-10-21  21/10/21
    7  18th Dec  18 Dec 2021 2021-12-18  18/12/21
    8   5th Sep   5 Sep 2020 2020-09-05  05/09/20
    9  16th Dec  16 Dec 2020 2020-12-16  16/12/20