Search code examples
pythondata-analysis

Categorize birth-death data in epochs


I have data regarding the years of birth and death of several people. I want to compute efficiently how many people are in each of a group of pre-defined epochs.

For example. If I have this list of data:

  • Paul 1920-1950
  • Sara 1930-1950
  • Mark 1960-2020
  • Lennard 1960-1970

and I define the epochs 1900-1980 and 1980-2023, I would want to compute the number of people alive in each period (not necessarily the whole range of the years). In this case, the result would be 4 people (Paul, Sara, Mark and Lennard) for the first epoch and 1 person (Mark) for the second epoch.

Is there any efficient routine out there? I would like to know, as the only way I can think of now is to create a huge loop with a lot of ifs to start categorizing.

I really appreciate any help you can provide.


Solution

  • So, you need to check that they have been born before the end of the period AND they have died after the start of the period.

    One way could be:

    # add columns for birth year and death year
    df[['birth', 'death']] = df['birt/death'].str.split('-', expand=True)
    
    
    # convert to numeric (https://stackoverflow.com/a/43266945/15032126)
    cols = ['birth', 'death']
    df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)
    
    index name birt/death birth death
    0 Paul 1920-1950 1920 1950
    1 Sara 1920-1950 1920 1950
    2 Mark 1960-2020 1960 2020
    3 Lennard 1960-1970 1960 1970
    def counts_per_epoch(df, start, end):
        return len(df[(df['birth'] <= end) & (df['death'] >= start)])
    
    print(counts_per_epoch(df, 1900, 1980))
    print(counts_per_epoch(df, 1980, 2023))
    
    # Prints
    4
    1