Search code examples
rnamingdata-cleaninglarge-data

Multiple columns processing and dynamically naming new columns


Variables are mistakenly being entered into multiple columns eg: "aaa_1", "aaa_2" and "aaa_3", or "ccc_1, "ccc_2", and "ccc_3"). Need to create single new columns (eg "aaa", or "ccc"). Some variables are currently in a single column though ("hhh_1"), but more columns may be added (hhh_2 etc).

This is what I got:

aaa_1 <- c(43, 23, 65, NA, 45)  
aaa_2 <- c(NA, NA, NA, NA, NA)    
aaa_3 <- c(NA, NA, 92, NA, 82)  
ccc_1 <- c("fra", NA, "spa", NA, NA)   
ccc_2 <- c(NA, NA, NA, "wez", NA)  
ccc_3 <- c(NA, "ija", NA, "fda", NA)    
ccc_4 <- c(NA, NA, NA, NA, NA)
hhh_1 <- c(183, NA, 198, NA, 182)    
dataf1 <- data.frame(aaa_1,aaa_2,aaa_3,ccc_1,ccc_2, ccc_3,ccc_4,hhh_1)  

This is what I want:

aaa <- c(43, 23, NA, NA, NA)
ccc <- c("fra", "ija", "spa", NA, NA) 
hhh <- c(183, NA, 198, NA, 182)   
dataf2 <- data.frame(aaa,ccc,hhh)

General solution needed as there are ~100 variables (eg "aaa", "hhh", "ccc", "ttt", "eee", "hhh"etc).

Thanks!


Solution

  • This is a base solution, i.e. no packages.

    First define get_only which when given a list converts it to a data.frame and applies get_only to each row. When given a vector it returns the single non-NA in it or NA if there is not only one.

    Define root to be the column names without the suffixes.

    Convert the data frame to a list of columns, group them by root and apply get_only to each such group.

    Finally, convert the resulting list to a data frame.

    get_only <- function(x) UseMethod("get_only")
    get_only.list <- function(x) apply(data.frame(x), 1, get_only)
    get_only.default <- function(x) if (sum(!is.na(x)) == 1) na.omit(x) else NA
    
    root <- sub("_.*", "", names(dataf1))
    as.data.frame(lapply(split(as.list(dataf1), root), FUN = get_only))
    

    giving:

      age country hight
    1  43     fra   183
    2  23     ija    NA
    3  NA     spa   198
    4  NA    <NA>    NA
    5  NA    <NA>   182