After researching for a while, and trying with sub or gsub I didn't find exactly what I would like.
Input:
structure(list(submitter_id = c("TCGA-B6-A0RH-01A-21R-A115-07",
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07",
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-A1-A0SE-01A-11R-A084-07",
"TCGA-BH-A1FC-11A-32R-A13Q-07", "TCGA-OL-A5D6-01A-21R-A27Q-07",
"TCGA-E2-A1IK-01A-11R-A144-07", "TCGA-AC-A2FM-11B-32R-A19W-07",
"TCGA-AN-A0FT-01A-11R-A034-07"), sample_type = c("Primary Tumor",
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Metastatic",
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Solid Tissue Normal",
"Primary Tumor")), row.names = c(NA, 10L), class = "data.frame")
What I'd like to do is to keep only "Tumor" and "Normal" from string if exist in 'sample_type' column and remove everything. Further I would like to select row only if it consist "Tumor" and "Normal".
Expected output:
structure(list(submitter_id = c("TCGA-B6-A0RH-01A-21R-A115-07",
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07",
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-OL-A5D6-01A-21R-A27Q-07", "TCGA-E2-A1IK-01A-11R-A144-07",
"TCGA-AC-A2FM-11B-32R-A19W-07", "TCGA-AN-A0FT-01A-11R-A034-07"
), sample_type = c("Tumor", "Normal", "Tumor", "Tumor", "Normal",
"Tumor", "Tumor", "Normal", "Tumor")), row.names = c(NA, 9L), class = "data.frame")
Thank you
I tried gsub or sub and substr but failed to work since character length varying.
library(tidyverse)
df <- structure(list(submitter_id = c(
"TCGA-B6-A0RH-01A-21R-A115-07",
"TCGA-BH-A1FU-11A-23R-A14D-07", "TCGA-BH-A1FU-01A-11R-A14D-07",
"TCGA-AR-A0TX-01A-11R-A084-07", "TCGA-A1-A0SE-01A-11R-A084-07",
"TCGA-BH-A1FC-11A-32R-A13Q-07", "TCGA-OL-A5D6-01A-21R-A27Q-07",
"TCGA-E2-A1IK-01A-11R-A144-07", "TCGA-AC-A2FM-11B-32R-A19W-07",
"TCGA-AN-A0FT-01A-11R-A034-07"
), sample_type = c(
"Primary Tumor",
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Metastatic",
"Solid Tissue Normal", "Primary Tumor", "Primary Tumor", "Solid Tissue Normal",
"Primary Tumor"
)), row.names = c(NA, 10L), class = "data.frame")
df |>
mutate(sample_type = str_extract(sample_type, c("Tumor|Normal"))) |>
drop_na(sample_type)
#> submitter_id sample_type
#> 1 TCGA-B6-A0RH-01A-21R-A115-07 Tumor
#> 2 TCGA-BH-A1FU-11A-23R-A14D-07 Normal
#> 3 TCGA-BH-A1FU-01A-11R-A14D-07 Tumor
#> 4 TCGA-AR-A0TX-01A-11R-A084-07 Tumor
#> 5 TCGA-BH-A1FC-11A-32R-A13Q-07 Normal
#> 6 TCGA-OL-A5D6-01A-21R-A27Q-07 Tumor
#> 7 TCGA-E2-A1IK-01A-11R-A144-07 Tumor
#> 8 TCGA-AC-A2FM-11B-32R-A19W-07 Normal
#> 9 TCGA-AN-A0FT-01A-11R-A034-07 Tumor
Created on 2024-04-13 with reprex v2.1.0