Search code examples
rextract

How to delete all strings except some specific name in R?


After researching for a while, and trying with sub or gsub I didn't find exactly what I would like.

Input:

structure(list(submitter_id = c("TCGA-B6-A0RH-01A-21R-A115-07", 
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07", 
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-A1-A0SE-01A-11R-A084-07", 
"TCGA-BH-A1FC-11A-32R-A13Q-07", "TCGA-OL-A5D6-01A-21R-A27Q-07", 
"TCGA-E2-A1IK-01A-11R-A144-07", "TCGA-AC-A2FM-11B-32R-A19W-07", 
"TCGA-AN-A0FT-01A-11R-A034-07"), sample_type = c("Primary Tumor", 
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Metastatic", 
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Solid Tissue Normal", 
"Primary Tumor")), row.names = c(NA, 10L), class = "data.frame")

What I'd like to do is to keep only "Tumor" and "Normal" from string if exist in 'sample_type' column and remove everything. Further I would like to select row only if it consist "Tumor" and "Normal".

Expected output:

structure(list(submitter_id = c("TCGA-B6-A0RH-01A-21R-A115-07", 
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07", 
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-BH-A1FC-11A-32R-A13Q-07", 
"TCGA-OL-A5D6-01A-21R-A27Q-07", "TCGA-E2-A1IK-01A-11R-A144-07", 
"TCGA-AC-A2FM-11B-32R-A19W-07", "TCGA-AN-A0FT-01A-11R-A034-07"
), sample_type = c("Tumor", "Normal", "Tumor", "Tumor", "Normal", 
"Tumor", "Tumor", "Normal", "Tumor")), row.names = c(NA, 9L), class = "data.frame")

Thank you

I tried gsub or sub and substr but failed to work since character length varying.


Solution

  • library(tidyverse)
    
    df <- structure(list(submitter_id = c(
      "TCGA-B6-A0RH-01A-21R-A115-07",
      "TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07",
      "TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-A1-A0SE-01A-11R-A084-07",
      "TCGA-BH-A1FC-11A-32R-A13Q-07", "TCGA-OL-A5D6-01A-21R-A27Q-07",
      "TCGA-E2-A1IK-01A-11R-A144-07", "TCGA-AC-A2FM-11B-32R-A19W-07",
      "TCGA-AN-A0FT-01A-11R-A034-07"
    ), sample_type = c(
      "Primary Tumor",
      "Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Metastatic",
      "Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Solid Tissue Normal",
      "Primary Tumor"
    )), row.names = c(NA, 10L), class = "data.frame")
    
    df |>
      mutate(sample_type = str_extract(sample_type, c("Tumor|Normal"))) |>
      drop_na(sample_type)
    #>                   submitter_id sample_type
    #> 1 TCGA-B6-A0RH-01A-21R-A115-07       Tumor
    #> 2 TCGA-BH-A1FU-11A-23R-A14D-07      Normal
    #> 3 TCGA-BH-A1FU-01A-11R-A14D-07       Tumor
    #> 4 TCGA-AR-A0TX-01A-11R-A084-07       Tumor
    #> 5 TCGA-BH-A1FC-11A-32R-A13Q-07      Normal
    #> 6 TCGA-OL-A5D6-01A-21R-A27Q-07       Tumor
    #> 7 TCGA-E2-A1IK-01A-11R-A144-07       Tumor
    #> 8 TCGA-AC-A2FM-11B-32R-A19W-07      Normal
    #> 9 TCGA-AN-A0FT-01A-11R-A034-07       Tumor
    

    Created on 2024-04-13 with reprex v2.1.0