Search code examples
rdplyrtidyrrbindmutate

Subset large dataframe over distinct singular character variable column value


I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.

I'm not sure if it's a loop that is supposed to be used for this, but basically I want to take a distinct on each V(X)_ID and it's associated V(X)_Z and return the first row V(X)_ID along with each subsequent occurrence of every other V(X)_Z (There are only two distinct types of V(X)_Z's).

This probably sounds confusing in words so hopefully an example from df to the desired df2 will assist in visualising what I'm trying to ask.

V1_ID <- c('A','B','I','N','G') 
V1_X <- c(1,2,3,4,5) 
V1_Y <- c(5.1,4.2,3.2,2.1,6.1) 
V1_Z <- c('Tom','Tom','Bill','Tom','Tom')

V2_ID <- c('B','D','E','F','G') 
V2_X <- c(2,5,6,7,5) 
V2_Y <- c(4.2,2,1,9,6.1) 
V2_Z <- c('Tom','Tom','Tom','Tom','Tom')

V3_ID <- c('C','B','A','N','G') 
V3_X <- c(0,2,1,4,5) 
V3_Y <- c(3,4.2,5.1,2.1,6.1) 
V3_Z <- c('Bill','Bill','Bill','Tom','Tom')

V4_ID <- c('N','G','C','B','A') 
V4_X <- c(4,5,0,2,1) 
V4_Y <- c(2,6.1,3,4.2,5.1) 
V4_Z <- c('Tom','Tom','Bill','Bill','Bill')

df <-data.frame(V1_ID,V1_X,V1_Y,V1_Z,V2_ID,V2_X,V2_Y,V2_Z,V3_ID,V3_X,V3_Y,V3_Z,V4_ID,V4_X,V4_Y,V4_Z)

V1_ID <- c('A','I',NA,NA) 
V1_X <- c(1,3,NA,NA) 
V1_Y <- c(5.1,3.2,NA,NA) 
V1_Z <- c('Tom','Bill',NA,NA)

V3_ID <- c('C','N','G',NA) 
V3_X <- c(0,4,5,NA) 
V3_Y <- c(3,2.1,6.1,NA) 
V3_Z <- c('Bill','Tom','Tom',NA)

V4_ID <- c('N','C','B','A') 
V4_X <- c(4,0,2,1) 
V4_Y <- c(2,3,4.2,5.1) 
V4_Z <- c('Tom','Bill','Bill','Bill')

df2 <- data.frame(V1_ID,V1_X,V1_Y,V1_Z,V3_ID,V3_X,V3_Y,V3_Z,V4_ID,V4_X,V4_Y,V4_Z)

You can see that V2 has been excluded from the desired dataframe because there are no occurrences of other distinct V2_Z aside from "Tom".

Your assistance is much appreciated as I have hundreds of these types of columns in this type of format and approaching it from a manual standpoint is very draining.

Thanks


Solution

  • I think it is a good idea to bring this data first in long format and then think about what you want to filter. Below is a first approach, maybe you can elaborate a bit more on the exact conditions you want to filter.

    library(tidyverse)
    
    df_long <- df %>%
      pivot_longer(cols = everything(),
                   names_to = c("no", ".value"),
                   names_pattern = "(.*)_([^_]+)$",
                   values_transform = as.character)
    
    df_long %>% group_by(no) %>% 
      # here we filter all groups `no` which only have one value in `Z`:
      filter(n_distinct(Z) > 1) %>% 
      filter(c(Z[1] == first(Z), Z[-1] != first(Z))) 
    
    #> # A tibble: 9 × 5
    #> # Groups:   no [3]
    #>   no    ID    X     Y     Z    
    #>   <chr> <chr> <chr> <chr> <chr>
    #> 1 V1    A     1     5.1   Tom  
    #> 2 V3    C     0     3     Bill 
    #> 3 V4    N     4     2     Tom  
    #> 4 V1    I     3     3.2   Bill 
    #> 5 V4    C     0     3     Bill 
    #> 6 V3    N     4     2.1   Tom  
    #> 7 V4    B     2     4.2   Bill 
    #> 8 V3    G     5     6.1   Tom  
    #> 9 V4    A     1     5.1   Bill
    

    Data from OP

    V1_ID <- c('A','B','I','N','G') 
    V1_X <- c(1,2,3,4,5) 
    V1_Y <- c(5.1,4.2,3.2,2.1,6.1) 
    V1_Z <- c('Tom','Tom','Bill','Tom','Tom')
    
    V2_ID <- c('B','D','E','F','G') 
    V2_X <- c(2,5,6,7,5) 
    V2_Y <- c(4.2,2,1,9,6.1) 
    V2_Z <- c('Tom','Tom','Tom','Tom','Tom')
    
    V3_ID <- c('C','B','A','N','G') 
    V3_X <- c(0,2,1,4,5) 
    V3_Y <- c(3,4.2,5.1,2.1,6.1) 
    V3_Z <- c('Bill','Bill','Bill','Tom','Tom')
    
    V4_ID <- c('N','G','C','B','A') 
    V4_X <- c(4,5,0,2,1) 
    V4_Y <- c(2,6.1,3,4.2,5.1) 
    V4_Z <- c('Tom','Tom','Bill','Bill','Bill')
    
    df <-data.frame(V1_ID,V1_X,V1_Y,V1_Z,V2_ID,V2_X,V2_Y,V2_Z,V3_ID,V3_X,V3_Y,V3_Z,V4_ID,V4_X,V4_Y,V4_Z)
    

    Created on 2023-02-21 by the reprex package (v2.0.1)