Search code examples
pythonpandasdate-arithmeticsummarize

Pandas: Get per-year counts for Dateranges spanning multiple years


I have a dataframe with records spanning multiple years:

WarName    |     StartDate     |    EndDate
---------------------------------------------
 'fakewar1'    01-01-1990           02-02-1995
 'examplewar'  05-01-1990           03-07-1998
 (...)
 'examplewar2'  05-07-1999           06-09-2002

I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:

  Year  |  Number_of_wars
----------------------------
  1989         0
  1990         2
  1991         2
  1992         3
  1994         2

Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.

I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).

years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
    range = year_in_range(year, row)
    if range = True: 
       year_dict[year] = year_dict.get(year, 0) + 1

This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?


Solution

  • Use a comprehension with pd.value_counts

    pd.value_counts([
        d.year for s, e in zip(df.StartDate, df.EndDate)
        for d in pd.date_range(s, e, freq='Y')
    ]).sort_index()
    
    1990    2
    1991    2
    1992    2
    1993    2
    1994    2
    1995    1
    1996    1
    1997    1
    1999    1
    2000    1
    2001    1
    dtype: int64
    

    Alternate

    from functools import reduce
    
    def r(t):
        return pd.date_range(t.StartDate, t.EndDate, freq='Y')
    
    pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
    

    Setup

    df = pd.DataFrame(dict(
        WarName=['fakewar1', 'examplewar', 'feuxwar2'],
        StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
        EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
    ), columns=['WarName', 'StartDate', 'EndDate'])
    
    df
    
          WarName  StartDate    EndDate
    0    fakewar1 1990-01-01 1995-02-02
    1  examplewar 1990-05-01 1998-03-07
    2    feuxwar2 1999-05-07 2002-06-09