Search code examples
rwhile-loopsimplification

Remove rows until columns are identical over multiple data frames


I have 4 data frames named w, x, y, z each with 3 columns and identical column names. I now execute an operation that removes rows until the column named Type is identical over all four data frames.

To achieve this I am using a while loop with the following code:


list_df <- list(z, w, x, y)
tmp <- lapply(list_df, `[[`, 'Type')
i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
                
while (i == 0) {
                  
 z <- z[(z$Type %in% x$Type),]
 y <- y[(y$Type %in% x$Type),]
 w <- w[(w$Type %in% x$Type),]
                      
 z <- z[(z$Type %in% w$Type),]
 y <- y[(y$Type %in% w$Type),]
 x <- x[(x$Type %in% w$Type),]
                     
 z <- z[(z$Type %in% y$Type),]
 x <- x[(x$Type %in% y$Type),]
 w <- w[(w$Type %in% y$Type),]
                      
 x <- x[(x$Type %in% z$Type),]
 w <- w[(w$Type %in% z$Type),]
 y <- y[(y$Type %in% z$Type),]
                     
 list_df <- list(z, w, x, y)
 tmp <- lapply(list_df, `[[`, 'Type')
 i <- as.integer(as.logical(all(sapply(tmp, function(x) all(x == tmp[[1]])))))
 }

In this code, a list is created for the column Type of every data frame. Then the value i tests for identicality and produces 0 if false and 1 if true. The while loop then performs the deletion of rows not included in every data frame and only stops until i becomes 1.

This code works, but applying it to bigger data can result in a long time for the code to go through. Does anybody have an idea on how to simplify this execution?

For reproducible example:

w <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K", 
"130747"), X1980 = c(NA, NA, NA, 271835, NA, NA, NA, NA, NA, 
NA, NA, NA), X1981 = c(NA, NA, NA, 290314, NA, NA, NA, NA, NA, 
NA, NA, NA)), row.names = c("2", "4", "7", "8", "10", "11", "13", 
"16", "17", "21", "22", "23"), class = "data.frame")

x <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V", "680681", "329909"), X1980 = c(NA, NA, NA, 1026815, 
NA, NA, NA), X1981 = c(NA, NA, NA, 826849, NA, NA, NA)), row.names = c("2", 
"4", "7", "8", "10", "11", "13"), class = "data.frame")

y <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V"), X1980 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_), X1981 = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_)), row.names = c("2", "4", "7", "8", "10"), class = "data.frame")

z <- structure(list(Type = c("26809D", "28503C", "360254", "69298N", 
"32708V", "680681", "329909", "696978", "32993F", "867609", "51206K", 
"130747", "50610H"), X1980 = c(NA, NA, NA, 0.264736101439889, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), X1981 = c(NA, NA, NA, 0.351108848169376, 
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("2", "4", 
"7", "8", "10", "11", "13", "16", "17", "21", "22", "23", "24"
), class = "data.frame")

Solution

  • We assume that the question is how to get the values of Type that are common to 4 data frames each of which has a Type column containing unique values.

    Form a list L of the data frames, extract the Type column using lapply and [ and iterate merge over that using Reduce :

    L <- list(w, x, y, z)
    L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE) # list of DFs w only Type col
    Reduce(merge, L.Type)$Type
    ## [1] "26809D" "28503C" "32708V" "360254" "69298N"
    

    or replace last line with this giving the same result except for order:

    Reduce(intersect, L.Type)$Type
    ## [1] "26809D" "28503C" "360254" "69298N" "32708V"
    

    Another approach which is a bit tedious but does reduce the calulation to one line is to manually iterate intersect:

    intersect(w$Type, intersect(x$Type, intersect(y$Type, z$Type)))
    ## [1] "26809D" "28503C" "360254" "69298N" "32708V"
    

    Another example

    The example data is not very good to illustrate this because every data frame has the same values of Type so let us create another example. BOD is a built-in data frame has 6 rows. We assign it to X and rename the columns so that the first one has the name Type. Then for i equals 1, 2, 3, 4 we remove the i-th row giving 4 data frames with 5 rows each and 2 values of Type common to all 4. The result correctly shows that 5 and 7 are the only common Type values.

    # set up input L, a list of 4 data frames
    X <- BOD
    names(X) <- c("Type", "X")
    L <- lapply(1:4, function(i) X[-i, ])
    
    L.Type <- lapply(L, "[", TRUE, "Type", drop = FALSE)
    Reduce(merge, L.Type)$Type
    ## [1] 5 7