Search code examples
rconditional-statementsfrequency

compare two groups based on categorial variable in R


I have created df which contains more than 8,000 firm years

gvkey = company id

fam = dummy (equals 1 if firm is family firm)

industry = categorial variable

   gvkey   fam  industry
1   1004    0     6
2   1004    0     6
3   1004    0     6
4   1004    0     6
5   1004    0     6
6   1013    0     4
7   1013    0     4
8   1013    0     4
9   1013    0     4
10  1013    0     4
11  1013    0     4
12  1045    0     5
13  1045    0     5
14  1045    0     5
15  1045    0     5
16  1045    0     5
17  1045    0     5
18  1072    0     4
19  1072    0     4
20  1072    0     4
21  1072    0     4
22  1072    0     4
23  1076    1     9
24  1076    1     9
25  1076    1     9
26  1076    1     9
27  1076    1     9
28  1076    1     9
29  1078    0     4
30  1078    0     4
31  1078    0     4
32  1078    0     4
33  1078    0     4
34  1078    0     4
35  1121    1     6
36  1121    1     6
37  1121    1     6
38  1121    1     6
39  1121    1     6
40  1121    1     6
41  1161    0     4
42  1161    0     4
43  1161    0     4
44  1161    0     4
45  1161    0     4
46  1161    0     4
47  1209    0     4
48  1209    0     4
49  1209    0     4
50  1209    0     4
...

This is how the output should kind of look like. Industry description = industry

This is the final output that I want to create in my paper. The column industry description equals my column industry

verbal logic:

1) For all unique gvkey create a column which counts the number of fam = 0 in each industry.

2) For all unique gvkey create a column which counts the number of fam = 1 in each industry.

3) Create an output which shows the frequencies of family firms and non family firms for each idnustry

Maybe it even possible to execute this in one code?!

Thank you so much!!


Solution

  • One dplyr otion could be:

    df %>%
     group_by(industry) %>%
     summarise(n_family = n_distinct(gvkey[fam == 1]),
               n_no_family = n_distinct(gvkey[fam == 0]),
               perc_family = n_family/n_distinct(gvkey)*100) 
    
      industry n_family n_no_family perc_family
         <int>    <int>       <int>       <dbl>
    1        4        0           5           0
    2        5        0           1           0
    3        6        1           1          50
    4        9        1           0         100