I have a matrix in a long format (panel data) with several string variables, a categorical variable and a variable with numerical values.
The data contains information on output for several industrial sectors for countries at a given year. My idea would be to add two of those industry's for the same years within the same country, and change the name for the newly created industry.
For example, say I have the following matrix:
set.seed(10)
matrix <- cbind.data.frame(country = rep(c("aaa" , "bbb") , each = 6) , industry = rep(c("toys" , "paper") , each = 3 , times = 2) ,
year = rep(c(2000:2002) , times = 4) , production = sample(0:100 , 12) )
that gives:
country industry year production
[1,] "aaa" "toys" "2000" "8"
[2,] "aaa" "toys" "2001" "73"
[3,] "aaa" "toys" "2002" "75"
[4,] "aaa" "paper" "2000" "54"
[5,] "aaa" "paper" "2001" "71"
[6,] "aaa" "paper" "2002" "53"
[7,] "bbb" "toys" "2000" "38"
[8,] "bbb" "toys" "2001" "82"
[9,] "bbb" "toys" "2002" "87"
[10,] "bbb" "paper" "2000" "14"
[11,] "bbb" "paper" "2001" "91"
[12,] "bbb" "paper" "2002" "41"
I would like to add the production from "toys" with the one from "paper" for each year and country and call the new industry "toys and paper" like so:
year country variable value
1 2000 aaa toys_and_paper 62
2 2000 bbb toys_and_paper 52
3 2001 aaa toys_and_paper 144
4 2001 bbb toys_and_paper 173
5 2002 aaa toys_and_paper 128
6 2002 bbb toys_and_paper 128
I know how to do this using reshape2 and tidyverse like so:
library(reshape2)
library(tidyverse)
test <- dcast(matrix , year + country ~ industry)
test <- test %>%
mutate(toys_and_paper = paper + toys) %>%
select(year , country , toys_and_paper)
test <- melt(test , id.vars = c("year" , "country"))
Is there a more direct way to do this?
I think the original question has an example that is confusing since in the actual data set the there can be more industries.
Here's a toy data set with more than two industries
set.seed(10)
matrix <- cbind.data.frame(
country = rep(c("aaa", "bbb"), each = 9),
industry = rep(c("toys", "paper", "other"), each = 3, times = 2),
year = rep(c(2000:2002), times = 6),
production = sample(0:100, 18)
)
And a dplyr
solution to the problem
matrix %>%
dplyr::mutate(
industry = dplyr::if_else(
industry %in% c("toys", "paper"), "toys_and_paper", industry
)
) %>%
dplyr::group_by(
year,
country,
industry
) %>%
dplyr::summarise(
production = sum(production),
.groups = "drop"
)
If in the actual data set the industry
column is a factor (and it should), then you can replace the if_else()
statement for forcats::fct_unify()