Search code examples
pythonsql-serverexcelpandas

Import Excel to Pandas Dataframe, but preserve order of Day/Month in dates?


Data engineer here, just loading Excel xlsx files to a SQL Server database table. The problem is, the person creating the Excel files enters dates as dd/mm/yyyy (European format). If the day is > 12, the resultant xlsx reads the date as Text (eg. 31/03/2024). But if the day is 12 or less, the xlsx reads the date as a Date (eg. 06/04/2024 looks like June 4, but it's really April 6). It seems Pandas respects these datatype choices when loading the file into a DataFrame with pd.read_excel(). This means the above two dates end up in the database table as "2024-03-31" (correct) and "2024-06-04" (wrong).

How do I get these Excel files into the table, but correctly parse the date every time? I don't have control over the Excel files before they're saved and dropped into the target file folder.

Working Solution

Thanks to @Temunel, whose solution almost works. I modified it with an idea from here: Is there a Python equivalent to C#'s DateTime.TryParse()?

The real problem is, if Excel saves as a date (i.e. the first value is < 13), then Pandas reads it as a date, then converts it to a string, not an Excel date integer - it ends up as 'yyyy-mm-dd HH:MM:SS'. If it doesn't recognize the date (i.e. first value > 12), Pandas just reads it as a string - it ends up as 'dd/mm/yyyy'.

from datetime import datetime, timedelta

DATE_PATTERNS = ["%Y-%d-%m %H:%M:%S", "%d/%m/%Y"]

def excel_date_parser(val):
    """ Parse Excel serial dates and text dates in dd/mm/yyyy format. """
    # Check if it's a float or an int (Excel serial date)
    if isinstance(val, float) or isinstance(val, int):
        # Convert Excel date serial to datetime
        return datetime(1899, 12, 30) + timedelta(days=val)
    else:
        # Try parsing as text in one of the formats from DATE_PATTERNS list above
        try:
            return try_to_datetime(val)
        except Exception as e:
            print(f"Error parsing {val}: {e}")
            return None

def try_to_datetime(val):
    for fmt in DATE_PATTERNS:
        try:
            return pd.to_datetime(val, format=fmt, errors='raise')
        except:
            continue
    raise ValueError

df = pd.read_excel("file_with_Euro_dates.xlsx", dtype={"Date": str})
df["Date"] = df["Date"].apply(excel_date_parser)
df

Solution

  • You can use pd.to_datetime() to manually convert the columns to a consistent date format after reading the file.

    Solution

    1. Load the data.
    2. Separate the integer serial dates and the text dates.
    3. Convert the Excel serial numbers to dates.
    4. Parse the text dates using a specific format.

    Here's the code.

    import pandas as pd
    from datetime import datetime, timedelta
    
    # Function to parse Excel dates
    def excel_date_parser(val):
        """ Parse Excel serial dates and text dates in dd/mm/yyyy format. """
        # Check if it's a float or an int (Excel serial date)
        if isinstance(val, float) or isinstance(val, int):
            # Convert Excel date serial to datetime
            return datetime(1899, 12, 30) + timedelta(days=val)
        else:
            # Try parsing as text in the format 'dd/mm/yyyy'
            try:
                return pd.to_datetime(val, format='%d/%m/%Y', errors='raise')
            except Exception as e:
                print(f"Error parsing {val}: {e}")
                return None
    
    # Read the Excel file
    df = pd.read_excel('your_file.xlsx', dtype=object)
    
    # Convert the 'Date' column to datetime
    df['Date'] = df['Date'].apply(excel_date_parser)
    
    print(df)
    

    I hope this will help you a little.