Search code examples
pythonexcelpandasdatetext-parsing

Data parsing in pandas, python


I have an excel file with many columns, one of them, 'Column3' is date with some text in it, basically it looks like that:

26/05/20
XXX
YYY
12/05/2020

The data is written in DD/MM/YY format but pandas, just like excel, thinks that 12/05/2020 it's 05 Dec 2020 while it is 12 May 2020. (My windows is set to american date format)

Important note: when I open stock excel file, cells with 12/05/2020 already are Date type, trying to convert it to text it gives me 44170 which will give me wrong date if I just reformat it into DD/MM/YY

I added this line of code:

iport pandas as pd
    
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
df = pd.read_excel("my_file.xlsx", parse_dates=['Column3'], date_parser=dateparse)

But the text in the column generates an error.

ValueError: time data 'XXX' does not match format '%d/%m/%y'

I went a step further and manually removed all text (obviously I can't do it all the time) to see whether it works or nor, but then I got following error

dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
TypeError: strptime() argument 1 must be str, not datetime.datetime

I also tried this:

df['Column3'] = pd.to_datetime(df.Column3, format ='%d/%m/%y', errors="coerce") 
# if I make errors="ignore" it doesn't change anything.

in that case my 26/05/20 was correctly converted to 26 May 2020 but I lost all my text data(it's ok) and other dates which didn't match with my format argument. Because previously they were recognized as American type date.

My objective is to convert the data in Column3 to the same format so I could apply filters with pandas. I think it's couple solutions:

  1. tell Pandas to not convert text to date at all (but it is already saved as Date type in stock file, will it work?)
  2. somehow ignore text values and use date_parser= method co convert add dates to DD/MM/YY
  3. with help of pd.to_datetime convert 26/05/20 to 26 May 2020 and than convert 2020-09-06 00:00:00 to 9 June 2020 (seems to be the simplest one but ignore argument doesn't work.)

Here's link to small sample file https://easyupload.io/ca5p6w


Solution

  • Posting this as an answer, since it's too long for a comment

    The problem originates in Excel. If I open it in Excel, I see 2 strings that look like dates 26/05/20, 05/12/2020 and 06/02/2020. Note the difference between the 20 and 2020 On lines 24 and 48 I see dates in Column4. This seems to indicate the Excel is put together. Is this Excel assembled by copy-paste, or programmatically?

    loading it with just pd.read_excel gives these results for the dates:

    • 26/05/20
    • 2020-12-05 00:00:00
    • 2020-02-06 00:00:00

    If I do df["Column3"].apply(type) gives me

    • str
    • <class 'datetime.datetime'>
    • <class 'datetime.datetime'>

    So in the Excel file these are marked as datetime.

    Loading them with df = pd.read_excel(DATA_DIR / "sample.xlsx", dtype={"Column3": str}) changes the type of all to str, but does not change the output.

    If you open the extract the file, and go look at the xml file xl\worksheets\sheet1.xml directly and look for cell C26, you see it as 44170, while C5 is 6, which is a reference to 26/05/20 in xl/sharedStrings.xml

    How do you 'make' this Excel file? This can best be solved in how this file is put together.


    Workaround

    As a workaround, you can convert the dates piecemeal. The different format allows this:

    format1 = "%d/%m/%y"
    format2 = "%Y-%d-%m %H:%M:%S"
    

    Then you can do pd.to_datetime(dates, format=format1, errors="coerce") to only get the first dates, and NaT for the ones not according to the format. Then you use combine_first to fill the voids.

    dates = df["Column3"]  # of the one imported with dtype={"Column3": str}
    dates_parsed = (
        pd.to_datetime(dates, format=format1, errors="coerce")
        .combine_first(pd.to_datetime(dates, format=format2, errors="coerce"))
        .astype(object)
        .combine_first(dates)
    )
    

    The astype(object) is needed to fill in the empty places with the string values.