Search code examples
rreadr

Read columns with readr using regular expressions


I need to import data files with various column numbers. Finally, the code should be used by other co-workers being not very familiar with R. So it should be robust and without warning messages preferably. The main problem is that the headder is always ending with an additional "," which does not appear in the data below. Beside a whole bunch of unused columns the required columns are always labeled in the same way. I.e. there is always a certain string within the column name, but not necessarily the whole column name is identical.

The example code is a very simple approximation to my files. First, I would like to get rid of the error message due to the errornous comma at the end of the headder. Something like skip_col = ncol(headder). And second, I would like to read only columns with "*des*" within the column name.

My approach to deal with it looks simple in this simplified example but is not very satisfying in my more complex code.

library(tidyverse)

read_csv("date,col1des,col1foo,col2des,col3des,col2foo,col3foo, 
          2015-10-23T22:00:00Z,0.6,-1.5,-1.3,-0.5,1.8,0 
          2015-10-23T22:10:00Z,-0.5,-0.6,1.5,0.1,-0.3,0.3
          2015-10-23T22:20:00Z,0.1,0.2,-1.6,-0.1,-1.4,-0.4
          2015-10-23T22:30:00Z,1.7,-1.2,-0.2,-0.4,0.3,0.3")

if (length(grep("des", names(data))) > 0) {
  des <- data[grep("des", names(data))]
  des <- bind_cols(date = data$date, des)
}

So in my full code, I get the following warning messages:

1. Missing column names filled in: 'X184' [184] 
2. Duplicated column names deduplicated: [long list of unrequired columns with dublicated names]

I would appreciate a solution within the tidyverse. As far as I found it is not possible to use regular expressions directly within the read_csv call to specify column names, right? So perhaps the only way is to read the headder first and build the cols() call out of that. But this exceeds my R knowledge.

Edit:

I wonder if something like this is possible:

headline <- "date,col1des,col1foo,col2des,col3des,col2foo,col3foo,"
head     <- headline %>% strsplit(",") %>% unlist(use.names = FALSE)
head_des <- head[grep("des", head)]
data     <- read_csv("mydata.csv", col_types = cols_only(head_des[1] = "d", head_des[2] = "d"))

I would like to grep() the column names befor reading the whole data.


Solution

  • Edit no. 2

    In reaction to your comment; This works with your data-string:

    library(tidyverse)
    yourData <- "date,col1des,col1foo,col2des,col3des,col2foo,col3foo, 
              2015-10-23T22:00:00Z,0.6,-1.5,-1.3,-0.5,1.8,0 
              2015-10-23T22:10:00Z,-0.5,-0.6,1.5,0.1,-0.3,0.3
              2015-10-23T22:20:00Z,0.1,0.2,-1.6,-0.1,-1.4,-0.4
              2015-10-23T22:30:00Z,1.7,-1.2,-0.2,-0.4,0.3,0.3"
    
    data <- suppressWarnings(read_csv(yourData))
    
    header <- names(data)
    colList <- ifelse(str_detect(header,'des'),'c','_') %>% as.list
    suppressWarnings(read_csv(yourData,col_types = do.call(cols_only, colList)))
    #> # A tibble: 4 x 3
    #>   col1des col2des col3des
    #>   <chr>   <chr>   <chr>  
    #> 1 0.6     -1.3    -0.5   
    #> 2 -0.5    1.5     0.1    
    #> 3 0.1     -1.6    -0.1   
    #> 4 1.7     -0.2    -0.4
    

    EDIT

    Trying to accomodate your edited wishes and with the help of this Post:

    
    library(tidyverse)
    
    
    header <- suppressWarnings(readLines('file.csv')[1]) %>% 
      str_split(',',simplify = T)
    colList <- ifelse(str_detect(header,'des'),'c','_') %>% as.list
    suppressWarnings(read_csv(file = 'file.csv',col_types = do.call(cols_only, colList)))
    #> # A tibble: 4 x 3
    #>   col1des col2des col3des
    #>   <chr>   <chr>   <chr>  
    #> 1 0.6     -1.3    -0.5   
    #> 2 -0.5    1.5     0.1    
    #> 3 0.1     -1.6    -0.1   
    #> 4 1.7     -0.2    -0.4
    

    This is the most robust, most tidyverse way, I could come up with:

    library(tidyverse)
    
    file <- suppressWarnings(readLines('file.csv')) %>% 
      str_split(',') 
    dims <- file %>% map_int(~length(.))
    if(any(dims != median(dims))){
      file[[which(dims != median(dims))]] <- file[[which(dims != median(dims))]][1:median(dims)]
    }
    data <- file %>% map_chr(~paste(.,collapse = ',')) %>% 
      paste(., sep = '\n') %>% read_csv
    (data <- data %>% select(which(str_detect(names(data), pattern = 'des'))))
    #> # A tibble: 4 x 3
    #>   col1des col2des col3des
    #>     <dbl>   <dbl>   <dbl>
    #> 1     0.6    -1.3    -0.5
    #> 2    -0.5     1.5     0.1
    #> 3     0.1    -1.6    -0.1
    #> 4     1.7    -0.2    -0.4
    

    Where file.csv contains your data.