I have data regarding the years of birth and death of several people. I want to compute efficiently how many people are in each of a group of pre-defined epochs.
For example. If I have this list of data:
and I define the epochs 1900-1980 and 1980-2023, I would want to compute the number of people alive in each period (not necessarily the whole range of the years). In this case, the result would be 4 people (Paul, Sara, Mark and Lennard) for the first epoch and 1 person (Mark) for the second epoch.
Is there any efficient routine out there? I would like to know, as the only way I can think of now is to create a huge loop with a lot of ifs to start categorizing.
I really appreciate any help you can provide.
So, you need to check that they have been born before the end of the period AND they have died after the start of the period.
One way could be:
# add columns for birth year and death year
df[['birth', 'death']] = df['birt/death'].str.split('-', expand=True)
# convert to numeric (https://stackoverflow.com/a/43266945/15032126)
cols = ['birth', 'death']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)
index | name | birt/death | birth | death |
---|---|---|---|---|
0 | Paul | 1920-1950 | 1920 | 1950 |
1 | Sara | 1920-1950 | 1920 | 1950 |
2 | Mark | 1960-2020 | 1960 | 2020 |
3 | Lennard | 1960-1970 | 1960 | 1970 |
def counts_per_epoch(df, start, end):
return len(df[(df['birth'] <= end) & (df['death'] >= start)])
print(counts_per_epoch(df, 1900, 1980))
print(counts_per_epoch(df, 1980, 2023))
# Prints
4
1