Search code examples
dataframeduplicatesrowsubsetmultiple-columns

R: remove rows in data frame for which all columns contain same content or nothing


I have a data frame:

# create a data frame
V1 = c("gene_1", "gene_1", "", "")
V2 = c("gene_2", "gene_2", "", "")
V3 = c("gene_3", "gene_3", "gene_4", "")
V4 = c("gene_4", "gene_4", "", "")
V5 = c("gene_5", "gene_5", "gene_8", "")
V6 = c("gene_6", "gene_6", "gene_6", "gene_7")
df = as.data.frame(rbind(V1, V2, V3, V4, V5, V6))

The data frame df looks like this:

       V1     V2     V3     V4
V1 gene_1 gene_1              
V2 gene_2 gene_2              
V3 gene_3 gene_3 gene_4       
V4 gene_4 gene_4              
V5 gene_5 gene_5 gene_8       
V6 gene_6 gene_6 gene_6 gene_7

Now, I want to remove all the rows that have only labels of the same gene, resulting in:

       V1     V2     V3     V4            
V3 gene_3 gene_3 gene_4                  
V5 gene_5 gene_5 gene_8       
V6 gene_6 gene_6 gene_6 gene_7

I found several similar questions on stack overflow, including here but none of these solutions work for my exact issue. I feel like this should be easy, but I can't seem to find how to go about this.


Solution

  • I found a solution, based on another post that I found here:

    df[df == ''] <- NA
    df %>% filter(if_any(V2:V4, ~ .x != V1))       
    

    Gives:

           V1     V2     V3     V4
    V3 gene_3 gene_3 gene_4   <NA>
    V5 gene_5 gene_5 gene_8   <NA>
    V6 gene_6 gene_6 gene_6 gene_7