How to delete all strings except some specific name in R?

After researching for a while, and trying with sub or gsub I didn't find exactly what I would like.

Input:

structure(list(submitter_id = c("TCGA-B6-A0RH-01A-21R-A115-07", 
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07", 
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-A1-A0SE-01A-11R-A084-07", 
"TCGA-BH-A1FC-11A-32R-A13Q-07", "TCGA-OL-A5D6-01A-21R-A27Q-07", 
"TCGA-E2-A1IK-01A-11R-A144-07", "TCGA-AC-A2FM-11B-32R-A19W-07", 
"TCGA-AN-A0FT-01A-11R-A034-07"), sample_type = c("Primary Tumor", 
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Metastatic", 
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Solid Tissue Normal", 
"Primary Tumor")), row.names = c(NA, 10L), class = "data.frame")

What I'd like to do is to keep only "Tumor" and "Normal" from string if exist in 'sample_type' column and remove everything. Further I would like to select row only if it consist "Tumor" and "Normal".

Expected output:

structure(list(submitter_id = c("TCGA-B6-A0RH-01A-21R-A115-07", 
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07", 
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-BH-A1FC-11A-32R-A13Q-07", 
"TCGA-OL-A5D6-01A-21R-A27Q-07", "TCGA-E2-A1IK-01A-11R-A144-07", 
"TCGA-AC-A2FM-11B-32R-A19W-07", "TCGA-AN-A0FT-01A-11R-A034-07"
), sample_type = c("Tumor", "Normal", "Tumor", "Tumor", "Normal", 
"Tumor", "Tumor", "Normal", "Tumor")), row.names = c(NA, 9L), class = "data.frame")

Thank you

I tried gsub or sub and substr but failed to work since character length varying.

Solution

library(tidyverse)

df <- structure(list(submitter_id = c(
  "TCGA-B6-A0RH-01A-21R-A115-07",
  "TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07",
  "TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-A1-A0SE-01A-11R-A084-07",
  "TCGA-BH-A1FC-11A-32R-A13Q-07", "TCGA-OL-A5D6-01A-21R-A27Q-07",
  "TCGA-E2-A1IK-01A-11R-A144-07", "TCGA-AC-A2FM-11B-32R-A19W-07",
  "TCGA-AN-A0FT-01A-11R-A034-07"
), sample_type = c(
  "Primary Tumor",
  "Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Metastatic",
  "Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Solid Tissue Normal",
  "Primary Tumor"
)), row.names = c(NA, 10L), class = "data.frame")

df |>
  mutate(sample_type = str_extract(sample_type, c("Tumor|Normal"))) |>
  drop_na(sample_type)
#>                   submitter_id sample_type
#> 1 TCGA-B6-A0RH-01A-21R-A115-07       Tumor
#> 2 TCGA-BH-A1FU-11A-23R-A14D-07      Normal
#> 3 TCGA-BH-A1FU-01A-11R-A14D-07       Tumor
#> 4 TCGA-AR-A0TX-01A-11R-A084-07       Tumor
#> 5 TCGA-BH-A1FC-11A-32R-A13Q-07      Normal
#> 6 TCGA-OL-A5D6-01A-21R-A27Q-07       Tumor
#> 7 TCGA-E2-A1IK-01A-11R-A144-07       Tumor
#> 8 TCGA-AC-A2FM-11B-32R-A19W-07      Normal
#> 9 TCGA-AN-A0FT-01A-11R-A034-07       Tumor

^{Created on 2024-04-13 with reprex v2.1.0}