Search code examples
rsubsetmatching

How would I create a subset by matching multiple patterns at a specific location in column names?


Sample code using dput:

df <- structure(list
          (TCGA.OR.A5JP.01A = c(0.0980697379293791, NA, NA,0.883701102465278, 0.920634671107133), 
               TCGA.OR.A5JG.01A = c(0.909142796219422, NA, NA, 0.870551482839855, 0.9170243029211), 
               TCGA.PK.A5HB.01A = c(0.860316269591325, NA, NA, 0.283919878689488, 0.92350756003924), 
               TCGA.OR.A5JE.01A = c(0.288860652773179,NA, NA, 0.831906751819423, 0.913890036560933), 
               TCGA.OR.A5KU.01A = c(0.0897293436489091,NA, NA, 0.166760246036103, 0.920367435681197)), 
          row.names = c("cg00000029","cg00000108", "cg00000109", "cg00000165", "cg00000236"), 
          class = "data.frame")

I want to create a subset keeping columns which only contain certain patterns at positions 11 and 12 (I counted the "."s.). For example, the "x's" in TCGA.OR.A5xx.01A. I have a list of multiple codes/patterns for that position (e.g., "JG", "HB", "KU").

I have tried:

df_subset <- subset(df, select=grepl("JG|HB|KU",names(df)))

but it is not position specific and columns which coincidentally contain those patterns are included.

I also have a second question - can I somehow do this with a list of patterns? There are over 30 patterns I put in a list and I'm wondering if I could use that list instead of typing them all out again.


Solution

  • We could use a combination of str_locate and which to select columns. If you have a list of search terms, then those can be collapsed into one list with paste0. Then, we can locate the search terms at particular positions (i.e., 11 and 12), and select those columns.

    library(tidyverse)
    
    key_chr <- c("JG", "HB", "KU")
    search_terms <- paste0(key_chr, collapse = "|")
    
    df %>% 
      select(which(str_locate(names(df), search_terms)[,1] == 11 & str_locate(names(df), search_terms)[,2] == 12))
    

    Or in base R, we could write it as:

    df <- df[, which(regexpr(search_terms, names(df)) == 11)]
    

    Output

               TCGA.OR.A5JG.01A TCGA.PK.A5HB.01A TCGA.OR.A5KU.01A
    cg00000029        0.9091428        0.8603163       0.08972934
    cg00000108               NA               NA               NA
    cg00000109               NA               NA               NA
    cg00000165        0.8705515        0.2839199       0.16676025
    cg00000236        0.9170243        0.9235076       0.92036744