I have a dataframe with records spanning multiple years:
WarName | StartDate | EndDate
'fakewar1' 01-01-1990 02-02-1995
'examplewar' 05-01-1990 03-07-1998
'examplewar2' 05-07-1999 06-09-2002
I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:
Year | Number_of_wars
1989 0
1990 2
1991 2
1992 3
1994 2
Usually I would use someting like df.groupby('year').count()
to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.
I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).
years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
range = year_in_range(year, row)
if range = True:
year_dict[year] = year_dict.get(year, 0) + 1
This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?
Use a comprehension with pd.value_counts
d.year for s, e in zip(df.StartDate, df.EndDate)
for d in pd.date_range(s, e, freq='Y')
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
from functools import reduce
def r(t):
return pd.date_range(t.StartDate, t.EndDate, freq='Y')
pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
df = pd.DataFrame(dict(
WarName=['fakewar1', 'examplewar', 'feuxwar2'],
StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])
WarName StartDate EndDate
0 fakewar1 1990-01-01 1995-02-02
1 examplewar 1990-05-01 1998-03-07
2 feuxwar2 1999-05-07 2002-06-09