Search code examples
rcategorizationpanel-data

Categorize groups in panel according to value of a variable in a specific year in R


I have a panel of different countries in R and I want to create categories based on the value of a specific variable (in this case 'var3') in a specific year (here 3).

An example of what I currently have:

# create data
test.data = as.data.frame(matrix(rexp(200, rate=.1), ncol=5))
colnames(test.data) = c("year", "country", "var1", "var2", "var3")
test.data$year = rep.int(1:5, 8)
test.data$country = rep(1:8, each=5)

# calculate median, minimum and maximum of 'var3'
median = quantile(x = test.data[test.data$year == 3, 5], probs = c(0.5))
min = min(test.data[test.data$year == 3, 5])
max = max(test.data[test.data$year == 3, 5])

# create category variable based on values of 'var3'
test.data$cat.1 = cut(test.data$var3, c(min, median, max))

In this case the value of 'cat.1' depends on the value of 'var3' of the corresponding observation, but I want it to depend on the value of a specific year of the specific country (i.e. I want the same value for all years of a specific country). Is there a straightforward way to do that or do I have to do it manually (select the countries of each group and assign the values to them). If the number of groups is constant doing it manually is ok, but it's a bit cumbersome if you want to try different group sizes.

Currently the result looks like as follows:

year country       var1        var2       var3       cat.1
1     1       1  4.4206363  9.32628504  4.0988089  (1.2,6.71]
2     2       1  7.6072491  6.30949828 39.5694414        <NA>
3     3       1  3.3774183  7.94397550  8.8419793 (6.71,22.2]
4     4       1  1.0300372  9.93858310  0.4908481        <NA>
5     5       1  6.4514008  2.10367840 29.6052797        <NA>
6     1       2  8.7609877  5.76332181 17.4117561 (6.71,22.2]
7     2       2  6.1253021  0.17258071 23.9096280        <NA>
8     3       2 48.3335241  1.19255084  3.3644827  (1.2,6.71]
9     4       2 34.1683821 10.98216846 29.0255100        <NA>
10    5       2 15.5824154  2.53484781 16.3466249 (6.71,22.2]

But I want this instead:

year country       var1        var2       var3       cat.1
1     1       1  4.4206363  9.32628504  4.0988089 (6.71,22.2]
2     2       1  7.6072491  6.30949828 39.5694414 (6.71,22.2]
3     3       1  3.3774183  7.94397550  8.8419793 (6.71,22.2]
4     4       1  1.0300372  9.93858310  0.4908481 (6.71,22.2]
5     5       1  6.4514008  2.10367840 29.6052797 (6.71,22.2]
6     1       2  8.7609877  5.76332181 17.4117561  (1.2,6.71]
7     2       2  6.1253021  0.17258071 23.9096280  (1.2,6.71]
8     3       2 48.3335241  1.19255084  3.3644827  (1.2,6.71]
9     4       2 34.1683821 10.98216846 29.0255100  (1.2,6.71]
10    5       2 15.5824154  2.53484781 16.3466249  (1.2,6.71]

Solution

  • Something along the following lines, maybe? This first creates a variable that, for each country, corresponds to var3 in year 3, and then cuts this variable. This should work for many groups, if by group you mean countries.

    library(dplyr)
    out <- test.data %>% group_by(country) %>% mutate(to.cut = var3[year==3] )
    out$cat.1 = cut(out$to.cut, c(min, median, max), include.lowest=T)
    out
    
    Source: local data frame [40 x 7]
    Groups: country [8]
    
        year country      var1      var2      var3       cat.1   to.cut
       (int)   (int)     (dbl)     (dbl)     (dbl)      (fctr)    (dbl)
    1      1       1  2.945957  8.785060 21.820063 (10.3,35.5] 12.06913
    2      2       1  1.473719 29.944750  6.915839 (10.3,35.5] 12.06913
    3      3       1  8.880734  3.624519 12.069131 (10.3,35.5] 12.06913
    4      4       1 31.746000  9.698126  5.929075 (10.3,35.5] 12.06913
    5      5       1 34.639945  2.983025 15.438284 (10.3,35.5] 12.06913
    6      1       2 16.757240  8.719741 27.412963 (10.3,35.5] 14.74931
    7      2       2  1.155467  3.146425  1.730943 (10.3,35.5] 14.74931
    8      3       2  1.738710  2.292280 14.749311 (10.3,35.5] 14.74931
    9      4       2 13.120079  0.130744  3.000918 (10.3,35.5] 14.74931
    10     5       2 27.898422 10.891313 20.912835 (10.3,35.5] 14.74931
    

    Comment: The numbers are obviously different from your tables since we have different seeds for the random number generators. In your tables, the outcome of cut varies from country 1 to country 2. Since the cutting is done over all countries, this difference can very well be due to randomness. If this is not what you are expecting, please provide a seed with which your original table can be reproduced.