Search code examples
pandasmulti-indexpanel-data

Determine number of entrants and exits in each year in panel data


I have three data frames with the same variables for the same firms but corresponding to different years (2016, 2017, 2018). The ID variable corresponds to a firm ID.

df2016 = pd.DataFrame({"ID": [99,101,102,103,104], "A": [1,2,3,4,5], "B": [2,4,6,8,10], "year": [2016,2016,2016,2016,2016]})
    ID  A   B   year
0   99  1   2   2016
1   101 2   4   2016
2   102 3   6   2016
3   103 4   8   2016
4   104 5   10  2016

df2017 = pd.DataFrame({"ID": [100,101,102,104], "A": [5,6,7,8], "B": [9,11,13,15], "year": [2017,2017,2017,2017]})

    ID  A   B   year
0   100 5   9   2017
1   101 6   11  2017
2   102 7   13  2017
3   104 8   15  2017

df2018 = pd.DataFrame({"ID": [100,106], "A": [6,8], "B": [13,15], "year": [2018,2018]})
    ID  A   B   year
0   100 6   13  2018
1   106 8   15  2018

Goal

I want to determine the number of entering firms and exiting firms in each year.

Specifically, I would like:

  1. A new variable called entry that equals 1 if the firm ID was not in the data last year and is the data this year; and zero otherwise.
  2. Another variable called exit that equals 1 if the firm ID is in the data this year but is not in the data next year; and zero otherwise.
  3. Edge cases of entering in the first year or exiting in the last year could be handled however is best.

Desired Output (something like this):

            A   B    enter  exit
ID  year                
99  2016    1.0 2.0     1   1
101 2016    2.0 4.0     1   0
102 2016    3.0 6.0     1   0
103 2016    4.0 8.0     1   1
104 2016    5.0 10.0    1   0
100 2017    5.0 9.0     1   0
101 2017    6.0 11.0    0   1
102 2017    7.0 13.0    0   1
104 2017    8.0 15.0    0   1
100 2018    6.0 13.0    0   0
106 2018    8.0 15.0    1   0

I start by concatenating the three years of data and setting the multi-index to (ID, year):

df = pd.concat([df2016, df2017, df2018])

df.set_index(["ID", "year"], inplace=True)
           A    B
ID  year        
99  2016    1   2
101 2016    2   4
102 2016    3   6
103 2016    4   8
104 2016    5   10
100 2017    5   9
101 2017    6   11
102 2017    7   13
104 2017    8   15
100 2018    6   13
106 2018    8   15

Not sure where to go next.


Solution

  • You can use:

    df = pd.concat([df2016, df2017, df2018], ignore_index=True)
    g = df.groupby('ID')['year']
    df['entry'] = g.diff(1).ne(1).astype(int)
    df['exit'] = g.diff(-1).ne(-1).astype(int)
    print(df)
    
    # Output
         ID  A   B  year  entry  exit
    0    99  1   2  2016      1     1
    1   101  2   4  2016      1     0
    2   102  3   6  2016      1     0
    3   103  4   8  2016      1     1
    4   104  5  10  2016      1     0
    5   100  5   9  2017      1     0
    6   101  6  11  2017      0     1
    7   102  7  13  2017      0     1
    8   104  8  15  2017      0     1
    9   100  6  13  2018      0     1
    10  106  8  15  2018      1     1