Search code examples
rdplyrduplicatesdata-cleaning

Retaining one duplicated row


I have a dataset with duplicated rows in the columns cell line, pathway, drug but the activity column has a different output. For example in the first two rows of the data frame below everything is the same from cell, drug to pathway except the activity, the first row has RESISTANT in the activity column and the second row has SENSITIVE in the activity column. I want to retain the second row, which has an output of SENSITIVE in the activity.

Could you please assist on how I can go about to do that. I want to do this for all rows in the data frame that have an output like that, I want to retain the second duplicated row.

**cell**  **drug**         **pathway**   **activity**
AU656     5-FLORO          OTHER          RESISTANT
AU656     5-FLORO          OTHER          SENSITIVE
AU656     ALISERTIB        MITOSIS        INTERMEDIATE
AU656     ALISERTIB        MITOSIS        RESISTANT
AU656     AFITINIB         EGFR           SENSITIVE
AU656     AZD6482          PI3K           INTERMEDIATE
AU656     DORAMAPIMOD      JNK            INTERMEDIATE
AU656     DORAMAPIMOD      JNK            SENSITIVE

Solution

  • We group by cell, drug, pathway and slice the 2nd row (if exists) by taking the minimum of 2 and group size (n()), thus for group sizes of 1, it returns the first row

    library(dplyr)
    df1 %>%
       group_by(cell, drug, pathway) %>%
       slice(min(2, n())) %>%
       ungroup
    

    -output

    # A tibble: 5 × 4
      cell  drug        pathway activity    
      <chr> <chr>       <chr>   <chr>       
    1 AU656 5-FLORO     OTHER   SENSITIVE   
    2 AU656 AFITINIB    EGFR    SENSITIVE   
    3 AU656 ALISERTIB   MITOSIS RESISTANT   
    4 AU656 AZD6482     PI3K    INTERMEDIATE
    5 AU656 DORAMAPIMOD JNK     SENSITIVE   
    

    data

    df1 <- structure(list(cell = c("AU656", "AU656", "AU656", "AU656", "AU656", 
    "AU656", "AU656", "AU656"), drug = c("5-FLORO", "5-FLORO", "ALISERTIB", 
    "ALISERTIB", "AFITINIB", "AZD6482", "DORAMAPIMOD", "DORAMAPIMOD"
    ), pathway = c("OTHER", "OTHER", "MITOSIS", "MITOSIS", "EGFR", 
    "PI3K", "JNK", "JNK"), activity = c("RESISTANT", "SENSITIVE", 
    "INTERMEDIATE", "RESISTANT", "SENSITIVE", "INTERMEDIATE", "INTERMEDIATE", 
    "SENSITIVE")), class = "data.frame", row.names = c(NA, -8L))