Search code examples
rdataframedplyrgroup-by

Count all occurrences of a given text in grouped data


Part of my data looks as follows:

> q[,c(1,3)]
           Year       Language
1             1            C++
2             1              C
3             1            C++
4             1              C
5             1            C++
6             1     JavaScript
7             1            C++
8             2            C++
9             2           inny
10            2            C++
11            2           Java
12            3           Java
13            3           Java
14            3     JavaScript
15            3           Java
16            3     JavaScript
17            3           .NET
18            3           inny
19            3              R
20            3         Python
21            3           .NET
22            3         Python
23            3           Java
24            3           Java
25            3           Java
26            3           Java
27            3           Java
28            3           Java
29            3             C#
30            3            C++
31            3     JavaScript
32            3            C++
33            3     JavaScript
34            3           Java
35            3           Java
36            3         Python
37            3             C#
38            4              R
39            4              C
40            4           Java
41            4         Python
42            4            C++
43            4           .NET
44            4             C#
45            5           inny
46            5     JavaScript
47            5             C#
48            5         Python
49            5              R
50            2              C

The entire dataset named q also has other columns that are not relevant here. What I want to achieve is for each year to count the languages that occurred most often. Sometimes several languages occurred with the same highest max amount! That's why I want to list each such language.

Expected output:

    Year Language     
 1     1 C++       
 2     2 C++       
 3     3 Java      
 4     4 .NET      
 5     4 C         
 6     4 C#        
 7     4 C++       
 8     4 Java      
 9     4 Python    
10     4 R         
11     5 C#        
12     5 inny      
13     5 JavaScript
14     5 Python    
15     5 R   

Solution

  • Using dplyr:

    q %>% group_by(Year) %>% summarise(language=names(which(table(Language)==max(table(Language)))))
    

    output:

        Year language  
       <int> <chr>     
     1     1 C++       
     2     2 C++       
     3     3 Java      
     4     4 .NET      
     5     4 C         
     6     4 C#        
     7     4 C++       
     8     4 Java      
     9     4 Python    
    10     4 R         
    11     5 C#        
    12     5 inny      
    13     5 JavaScript
    14     5 Python    
    15     5 R