For my panel data analysis in Python I would like to inspect transition probabilities. I have person-year combinations and some categorical variable, for example health (1=excellent
, 2=good
, etc.).
I need a summary table of absolute and/or relative frequencies on how often changes happen from one state/category to another - per person, not per column. Especially the health state difference between index 6
and 7
should not be included, since it is not a transition from within one person.
Here is some sample data:
import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009',
'2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1',
'2', '2', '2', '2', '2', '2', '2',],
'health': ['3', '1', '2', '2', '5', '1', '1',
'1', '2', '3', '2', '1', '1', '2']}).astype(int)
Output should be as follows (counting the occurrences of state transition):
(Perhaps there is something in Python similar to Stata's xttrans
command?)
Create the new column with shift
. where
ensures we exclude it when the id
changes. Then this is crosstab
(or groupby size, or pivot_table) to get the counts.
import pandas as pd
#df = df.sort_values(['id', 'year'])
df['health_trans'] = df.health.shift(-1).where(df.id.eq(df.id.shift(-1)))
pd.crosstab(df.health, df.health_trans)
#health_trans 1.0 2.0 3.0 5.0
#health
#1 2 3 0 0
#2 1 1 1 1
#3 1 1 0 0
#5 1 0 0 0
To ensure all transitions are always listed use reindex
.
health = range(1,6)
(pd.crosstab(df.health, df.health_trans)
.reindex(health).reindex(health, axis=1)
.fillna(0).astype(int))
#health_trans 1 2 3 4 5
#health
#1 2 3 0 0 0
#2 1 1 1 0 1
#3 1 1 0 0 0
#4 0 0 0 0 0
#5 1 0 0 0 0
This may not deal with cases where an id
is missing some years as you would like. It seems like you have a balanced panel to begin with, in which case there is no issue.