Search code examples
pythonpandaspanel-data

How to calculate transition probabilities


For my panel data analysis in Python I would like to inspect transition probabilities. I have person-year combinations and some categorical variable, for example health (1=excellent, 2=good, etc.).

I need a summary table of absolute and/or relative frequencies on how often changes happen from one state/category to another - per person, not per column. Especially the health state difference between index 6 and 7 should not be included, since it is not a transition from within one person.

Here is some sample data:

import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009',
                             '2003', '2004', '2005', '2006', '2007', '2008', '2009'],
                   'id': ['1', '1', '1', '1', '1', '1', '1', 
                          '2', '2', '2', '2', '2', '2', '2',],
                   'health': ['3', '1', '2', '2', '5', '1', '1', 
                             '1', '2', '3', '2', '1', '1', '2']}).astype(int)

Output should be as follows (counting the occurrences of state transition):

enter image description here

(Perhaps there is something in Python similar to Stata's xttrans command?)


Solution

  • Create the new column with shift. where ensures we exclude it when the id changes. Then this is crosstab (or groupby size, or pivot_table) to get the counts.

    import pandas as pd
    #df = df.sort_values(['id', 'year'])
    
    df['health_trans'] = df.health.shift(-1).where(df.id.eq(df.id.shift(-1)))
    pd.crosstab(df.health, df.health_trans)
    
    #health_trans  1.0  2.0  3.0  5.0
    #health                          
    #1               2    3    0    0
    #2               1    1    1    1
    #3               1    1    0    0
    #5               1    0    0    0
    

    To ensure all transitions are always listed use reindex.

    health = range(1,6)
    
    (pd.crosstab(df.health, df.health_trans)
       .reindex(health).reindex(health, axis=1)
       .fillna(0).astype(int))
    
    #health_trans  1  2  3  4  5
    #health                     
    #1             2  3  0  0  0
    #2             1  1  1  0  1
    #3             1  1  0  0  0
    #4             0  0  0  0  0
    #5             1  0  0  0  0
    

    This may not deal with cases where an id is missing some years as you would like. It seems like you have a balanced panel to begin with, in which case there is no issue.