Search code examples
pythonpandasgroup-bysum

Sum number of entries for each year in Pandas dataframe


I have a dataset from the National Hurricane Center, it has entries for all hurricanes with a unique name for each hurricane. There are many entries for each hurricane. I am trying to collect the number of hurricanes for each year. Complete dataset is: https://www.kaggle.com/noaa/hurricane-database

Data looks like this (abbreviated):

ID,Name,Date
AL011851,UNNAMED,18510625
AL011851,UNNAMED,18510625
AL011851,UNNAMED,18510625
AL021851,UNNAMED,18510705
AL031852,UNNAMED,18520710
AL041852,UNNAMED,18520816
AL041852,UNNAMED,18520810

My code is:

df = pd.read_csv("atlantic.csv")

#Convert date to pandas datetime
df["Date"] = pd.to_datetime(df["Date"].astype(str), format="%Y%m%d")
print(df.head(15))

df2 = pd.to_datetime(df["Date"]).dt.year.value_counts()

df2 only gives me the total number of entries for the year. I need a total of unique IDs not total entries. I am stumped on how to accomplish this.


Solution

  • In order to get the unique count of ID per year please use this

    df2 = df.set_index("Date").resample('Y').nunique()['ID']
    

    or

    df2 = df.groupby(df.Date.dt.year).nunique()['ID']