Search code examples
rdplyrsummarize

dplyr collapse by rank of variable but ignore NA


I am struggling with a collapse of my data.

Basically my data consists of multiple indicators with multiple observations for each year. I want to convert this to one observation for each indicator for each country.

I have a rank indicator which specifies the sequence by which sequence the observations have to be chosen.

Basically the observation with the first rank (thus 1 instead of 2) has to be chosen, as long as for that rank the value is not NA.

An additional question: The years in my dataset vary over time, thus is there a way to make the code dynamic in the sense that it applies the code to all column names from 1990 to 2025 when they exist?

df <- data.frame(country.code = c(1,1,1,1,1,1,1,1,1,1,1,1), 
                        id = as.factor(c("GDP", "GDP", "GDP", "GDP", "CA", "CA", "CA", "GR", "GR", "GR", "GR", "GR")), 
                       `1999` = c(NA,NA,NA, 1000,NA,NA, 100,NA,NA, NA,NA,22), 
                       `2000` = c(NA,NA,1, 2,NA,1, 2,NA,1000, 12,13,2), 
                       `2001` = c(3,100,1, 3,100,20, 1,1,44, 65,NA,NA),
                       rank = c(1, 2 , 3 , 4 , 1, 2, 3, 1, 3, 2, 4, 5))

The result should be the following dataset:

    result <- data.frame(country.code = c(1, 1, 1), 
                         id = as.factor(c("GDP", "CA", "GR")),
                         `1999`= c(1000, 100, 22),
                         `2000`= c(1, 1, 12),
                         `2001`= c(3, 100, 1))

I attempted the following solution (but this does not work given the NA's in the data and I would have to specify each column:

    test <- df %>% group_by(Country.Code, Indicator.Code) %>% 
                summarise(test1999 = `1999`[which.min(rank))

I don't see how I can explain R to omit the cases of the column 1999 that are NA.


Solution

  • We can subset using the minimum rank of the non-null values for a column e.g x[rank==min(rank[!is.na(x)])].

    An additional question: The years in my dataset vary over time,....

    Using summarise_at, vars and matches can be used to select any column name with 4 digits i.e. 1990-2025 using a regular expression [0-9]{4} (which means search for a digit "0-9" repeated exactly 4 times) and apply the above procedure to them using funs

    librar(dplyr)    
    df %>% group_by(country.code,id) %>% 
           summarise(`1999` = `1999`[rank==ifelse(all(is.na(`1999`)),1, min(rank[!is.na(`1999`)]))])
    
    df %>% group_by(country.code,id) %>% 
           summarise_at(vars(matches("[0-9]{4}")),funs(.[rank==ifelse(all(is.na(.)), 1, min(rank[!is.na(.)]))]))
    
     # A tibble: 3 x 5
     # Groups:   country.code [?]
      country.code id    `1999` `2000` `2001`
             <dbl> <fct>  <dbl>  <dbl>  <dbl>
    1            1 CA       100      1    100
    2            1 GDP     1000      1      3
    3            1 GR        22     12      1