Search code examples
rsubset

Subset data based on partial match of column names


I need to subset a df to include certain strings. Some of these are full column names, and the following works fine:

testData[,c("FullColName1","FullColName2","FullColName3")]

My problem is that I need to expand this to also include column names that contain specific strings that may partially match to some other column names. These strings include letters and symbols:

"PartString1()","PartString2()"

I tried putting wildcards around these. (I've indicated this below with the prefix "star" because the "*" symbol didn't render correctly.)

testData[ ,c("FullColName1","FullColName2","FullColName3",
             "starPartString1()star","starPartString2()star")]

But I'm getting an error message: "undefined columns selected". I can't figure out if or how I need grep() to make this work.


Solution

  • You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]] as our regular expression. This will find all the strings with punctuation symbols in the column names.

    d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
    names(d) <- c("FullColName1", "FullColName2", "FullColName3",
                  "PartString1()","PartString2()")
    
    d[grepl("[[:punct:]]", names(d))]
    #   PartString1() PartString2()
    # 1            13             1
    # 2            12             1
    # 3            11             1
    

    This last part just illustrates another way to do this with other string processing functions from stringr

    library(stringr)
    d[str_detect(names(d), "[[:punct:]]")]
    #   PartString1() PartString2()
    # 1            13             1
    # 2            12             1
    # 3            11             1
    

    ADD per OPs comment

    d[grepl("ring[12()]", names(d))]
    

    to get either of the substrings ring1() or ring2() from the names vector