pandas csv datetime leap-year merging-data

How to split year from date and make a new column; how to deal with leap years

I am very new to coding (this is the first code I am writing).

I have multiple csv files, all with the same headers. The files correspond to hourly ozone concentration for every day of the year, and each file is a separate year [range from 2009-2020]. I have a column 'date' that contains the year-month-day, and I have a column for hour of the day (0-23). I want to separate the year from the month-day, combine the hour with the month-day and make this the index, and then merge the other csv files into one dataframe.

In addition, I need to average data values from each day at each hour for all 10 years, however, three of my files include leap days (an extra 24 values). I would appreciate any advice on how to account for the leap years. I assume that I would need to add the leap day to the files without it, then provide null values, then drop the null values (but that seems circular).

Also, if you have any tips on how to simplify my process, feel free to share!

Thanks in advance for your help.

Update: I tried the advice from Rookie below, but after importing csv data, I get an error message:

import pandas as pd
import os

path = "C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP"

df = pd.DataFrame()
for file in os.listdir(path):
    df_temp = pd.read_csv(os.path.join(path, file))
    df = pd.concat((df, df_temp), axis = 0)

First, I get an error message that says OSError: Initializing from file failed. I tried to fix the issue by adding engine = 'python' based on advice from OSError: Initializing from file failed on csv in Pandas, but now I'm getting PermissionError: [Errno 13] Permission denied: 'C:/Users/heath/Documents/ARB project Spring2020/ozone/SJV/SKNP\\.ipynb_checkpoints'

Please help, I'm not sure what else to do. I edited the permission so that everyone has the read & write access. However, I still had the "permission denied" error when I imported the csv on Windows.

Solution

First, you want to identify what type of column you are dealing with once it is in a pandas DataFrame. This can be accomplished with the dtypes method. For example, if your DataFrame is df, you can do df.dtypes which will let you know what the column types are. If you see an object type, that will tell you that pandas is interpreting the object as a string (sequence of characters not an actual date or time value). If you see datetime64[ns], pandas knows this is a datetime value (date and time combined). If you see timedelta[ns], pandas knows this is a time difference (more on this later).

If the dtype are objects, let's convert these to a datetime64[ns] type so we can let pandas know we are dealing with date/time values. This can be done by simple reassignment. For example, if the format of the date is YYYY-mm-dd (2020-06-04), then we can convert the date column using the following method (assuming the name of your date column is "Date"). Please reference strftime for different formatting.

df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")

The time column is slightly more tricky. Pandas is not aware of just time so we need to convert time to a timedelta[64]. If the time format is hh:mm:ss (i.e. "21:02:24"), we can use the follow method to convert object type.

df["Time"] = pd.to_timedelta(df["Time"])

If the format is different, you will need to convert the string format to the hh:mm:ss format.

Now to combine these columns, we can now simple add them:

df["DateTime"] = df["Date"] + df["Time"]

To create the formatted datetime column you mentioned, you can do this by creating a new column in a string format. The below will give "06-04 21" indicating June 4 9 PM. strftime can guide whatever format you desire.

df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H")

You will need to do this for each file. I recommend using a for loop here. Below is a full code snippet. This will obviously vary depending on your column types, file names, etc.

import os # module to iterate over the files
import pandas as pd

base_path = "path/to/directory" # This is the directory path where all your files are stored

# It will be faster to read in all files at once THEN format the date
df = pd.DataFrame()
for file in os.listdir(base_path):
    df_temp = pd.read_csv(os.path.join(base_path, file)) # This will read every file in the base_path directory
    df = pd.concat((df, df_temp), axis=0) # Concatenating (merging) the files

# Formatting the data
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d") # Date conversion
df["Time"] = pd.to_timedelta(df["Time"]) # Time conversion
df["DateTime"] = df["Date"] + df["Time"] # Combine date and time to single column
df["Formatted_DateTime"] = df["DateTime"].dt.strftime("%m-%d %H") # Format the datetime values

Now that everything is formatted, the average portion is easy. Since you are only interested in averaging the values for each month-day hour, we can use the groupby capability.

df_group = df.groupby(["Formatted_DateTime"]) # This will group you data by unique values of the "Formatted_DateTime" column
df_average = df_group.mean() # This will average your data within each group (accounting for the leap years)

It's always good to check your work!

print(df_average.head(5)) # This will print the first 5 days averaged values