Search code examples
rselectdata.tablefread

R data.table fread select with a partial string match?


I would like to use fread to pull in only columns with names that match a condition. (In this case, I would like to pull in all columns that contain the label email.) Imagine that you have this data in a file called tempdata.txt in your working directory:

col1,col2,col3,email1,email2,col4,url1,url2,col5
1,2,3,4,5,6,7,8,9
9,8,7,6,5,4,3,2,1
x,x,x,[email protected],[email protected],y,y,y,y
a,a,a,a,a,a,http://google.com,http://stackoverflow.com,a

It is possible to use fread to load a subset of the columns if you know the names:

test <- data.table::fread("tempdata.txt", select=c("email1","email2"))

> test
      email1      email2
1:         4           5
2:         6           5
3: [email protected] [email protected]
4:         a           a

Is it also possible to select using a string match? I am trying to mimic this behavior but within the fread command:

> all <- data.table::fread("tempdata.txt")
> all %>% select(contains("email"))
      email1      email2
1:         4           5
2:         6           5
3: [email protected] [email protected]
4:         a           a

Thanks for any insight.


Solution

  • I don't know that fread doesn't have that capability (though I don't see it in the docs). However, a relatively inexpensive approach would be to read the first row or two, get the column names, grep them, and move on from there.

    library(data.table)
    fwrite(data.table(a=1:2, email1=c('a','b'), snailmail=c('c','d'), email2=c('e','f')), "test.csv")
    fread("test.csv", nrows=1)
    #    a email1 snailmail email2
    # 1: 1      a         c      e
    cols <- colnames(fread("test.csv", nrows=0))
    cols
    # [1] "a"         "email1"    "snailmail" "email2"   
    fread("test.csv", select = grep("^email", cols, value = TRUE))
    #    email1 email2
    # 1:      a      e
    # 2:      b      f