I have created df
which contains more than 8,000 firm years
gvkey
= company id
fam
= dummy (equals 1 if firm is family firm)
industry
= categorial variable
gvkey fam industry
1 1004 0 6
2 1004 0 6
3 1004 0 6
4 1004 0 6
5 1004 0 6
6 1013 0 4
7 1013 0 4
8 1013 0 4
9 1013 0 4
10 1013 0 4
11 1013 0 4
12 1045 0 5
13 1045 0 5
14 1045 0 5
15 1045 0 5
16 1045 0 5
17 1045 0 5
18 1072 0 4
19 1072 0 4
20 1072 0 4
21 1072 0 4
22 1072 0 4
23 1076 1 9
24 1076 1 9
25 1076 1 9
26 1076 1 9
27 1076 1 9
28 1076 1 9
29 1078 0 4
30 1078 0 4
31 1078 0 4
32 1078 0 4
33 1078 0 4
34 1078 0 4
35 1121 1 6
36 1121 1 6
37 1121 1 6
38 1121 1 6
39 1121 1 6
40 1121 1 6
41 1161 0 4
42 1161 0 4
43 1161 0 4
44 1161 0 4
45 1161 0 4
46 1161 0 4
47 1209 0 4
48 1209 0 4
49 1209 0 4
50 1209 0 4
...
This is how the output should kind of look like. Industry description = industry
verbal logic:
1) For all unique gvkey
create a column which counts the number of fam = 0 in each industry.
2) For all unique gvkey
create a column which counts the number of fam = 1 in each industry.
3) Create an output which shows the frequencies of family firms and non family firms for each idnustry
Maybe it even possible to execute this in one code?!
Thank you so much!!
One dplyr
otion could be:
df %>%
group_by(industry) %>%
summarise(n_family = n_distinct(gvkey[fam == 1]),
n_no_family = n_distinct(gvkey[fam == 0]),
perc_family = n_family/n_distinct(gvkey)*100)
industry n_family n_no_family perc_family
<int> <int> <int> <dbl>
1 4 0 5 0
2 5 0 1 0
3 6 1 1 50
4 9 1 0 100