Search code examples
rtidyversereshape2

Adding rows of two categorical variables in a long dataset?


I have a matrix in a long format (panel data) with several string variables, a categorical variable and a variable with numerical values.

The data contains information on output for several industrial sectors for countries at a given year. My idea would be to add two of those industry's for the same years within the same country, and change the name for the newly created industry.

For example, say I have the following matrix:

set.seed(10)

matrix <- cbind.data.frame(country = rep(c("aaa" , "bbb") , each = 6) , industry = rep(c("toys" , "paper") ,  each = 3 , times = 2) , 
                year = rep(c(2000:2002) , times = 4) , production = sample(0:100 , 12) )

that gives:

      country industry year   production
 [1,] "aaa"   "toys"   "2000" "8"       
 [2,] "aaa"   "toys"   "2001" "73"      
 [3,] "aaa"   "toys"   "2002" "75"      
 [4,] "aaa"   "paper"  "2000" "54"      
 [5,] "aaa"   "paper"  "2001" "71"      
 [6,] "aaa"   "paper"  "2002" "53"      
 [7,] "bbb"   "toys"   "2000" "38"      
 [8,] "bbb"   "toys"   "2001" "82"      
 [9,] "bbb"   "toys"   "2002" "87"      
[10,] "bbb"   "paper"  "2000" "14"      
[11,] "bbb"   "paper"  "2001" "91"      
[12,] "bbb"   "paper"  "2002" "41"  

I would like to add the production from "toys" with the one from "paper" for each year and country and call the new industry "toys and paper" like so:

  year country       variable value
1 2000     aaa toys_and_paper    62
2 2000     bbb toys_and_paper    52
3 2001     aaa toys_and_paper   144
4 2001     bbb toys_and_paper   173
5 2002     aaa toys_and_paper   128
6 2002     bbb toys_and_paper   128

I know how to do this using reshape2 and tidyverse like so:

library(reshape2)
library(tidyverse)

test <- dcast(matrix , year + country ~ industry)

test <- test %>%
  mutate(toys_and_paper = paper + toys) %>%
  select(year , country , toys_and_paper)

test <- melt(test , id.vars = c("year" , "country"))

Is there a more direct way to do this?


Solution

  • I think the original question has an example that is confusing since in the actual data set the there can be more industries.

    Here's a toy data set with more than two industries

    set.seed(10)
    
    matrix <- cbind.data.frame(
       country = rep(c("aaa", "bbb"), each = 9),
       industry = rep(c("toys", "paper", "other"), each = 3, times = 2),
       year = rep(c(2000:2002), times = 6),
       production = sample(0:100, 18)
    )
    

    And a dplyr solution to the problem

    matrix %>% 
       dplyr::mutate(
          industry = dplyr::if_else(
             industry %in% c("toys", "paper"), "toys_and_paper", industry
          )
       ) %>% 
       dplyr::group_by(
          year,
          country,
          industry
       ) %>% 
       dplyr::summarise(
          production = sum(production),
          .groups = "drop"
       )
    

    If in the actual data set the industry column is a factor (and it should), then you can replace the if_else() statement for forcats::fct_unify()