I am struggling with a collapse of my data.
Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.
I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.
Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.
An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?
df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1),
id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")),
`1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22),
`2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2),
`2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))
The result should be the following dataset:
result <- data.frame(country.code = c(1, 1, 1),
id = as.factor(c("GDP", "CA", "GR")),
`1999`= c(1000, 100, 22),
`2000`= c(1, 1, 12),
`2001`= c(3, 100, 1))
I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:
test <- df %>% group_by(Country.Code, Indicator.Code) %>%
summarise(test1999 = `1999`[which.min(rank))
I don't see how I can explain R to omit the cases of the column 1999 that are NA.
We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])]
.
An additional question: The years in my dataset vary over time,....
Using summarise_at
, vars
and matches
can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4}
(which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs
librar(dplyr)
df %>% group_by(country.code,id) %>%
summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
df %>% group_by(country.code,id) %>%
summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
# A tibble: 3 x 5
# Groups: country.code [?]
country.code id `1999` `2000` `2001`
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 CA 100 1 100
2 1 GDP 1000 1 3
3 1 GR 22 12 1