Search code examples
rstringdataframesubstring

R - Select columns of dataframe based on substrings/patterns in values


I regularly work with large datasets which are awkwardly labelled with a column heading and then a subheading in the second row. When imported into R, this means most of the data is converted to character automatically, but this isn't the issue, as I am able to correct this later. The issue is trying to use the subheadings in the first row to select certain columns for further analysis.

Here is some example data:

df <- data.frame(col1 = c("10='strongly agree' 0='strongly disagree'", "3", "5"),
                 col2 = c("5='far too much' 3='just right' 1='far too little'", "2", "1"),
                 col3 = c("5='far too thick' 3='just right' 1='far too thin'", "4", "5"),
                 col4 = c("10='strongly agree' 0='strongly disagree'", "8", "7"),
                 col5 = c("1='Yes' 2='No'", "1", "1"))

I want to be able to select columns based on (sub)strings/patterns found in the first row. The actual data has hundreds of columns, so it would be much more efficient to be able to do this in a few simple lines of code rather than manually selecting the columns.

A couple of possible selections I might need to make using the example data:

  1. select columns which have exact match "10='strongly agree' 0='strongly disagree'" for the first row
  2. select columns which have "1", "3" and "5" OR have both "'Yes'" and "'No'" somewhere in the string in the first row

In the case of the above, the outputs would end up the same as the following:

df_1 <- data.frame(col1 = c("10='strongly agree' 0='strongly disagree'", "3", "5"),
                 col4 = c("10='strongly agree' 0='strongly disagree'", "8", "7"))

df_2 <- data.frame(col2 = c("5='far too much' 3='just right' 1='far too little'", "2", "1"),
                 col3 = c("5='far too thick' 3='just right' 1='far too thin'", "4", "5"),
                 col5 = c("1='Yes' 2='No'", "1", "1"))

It would also be useful to know how to do this to check anywhere in a column rather than just for the first row, but this isn't important.

Thanks in advance!


Solution

  • We can create custom tidyselect functions and use them inside dplyr::select(where(...)):

    library(dplyr)
    
    is_10_to_0 <- function(x) {
      dplyr::first(x) == "10='strongly agree' 0='strongly disagree'"
    }
    
    is_5_3_1_or_Yes_No <- function(x) {
      grepl("(.*5.*3.*1.*)|(.*Yes.*No.*)", dplyr::first(x))
    }
    
    library(dplyr)
    
    df %>% 
      select(where(is_10_to_0))
    #>                                        col1
    #> 1 10='strongly agree' 0='strongly disagree'
    #> 2                                         3
    #> 3                                         5
    #>                                        col4
    #> 1 10='strongly agree' 0='strongly disagree'
    #> 2                                         8
    #> 3                                         7
    
    
    df %>% 
      select(where(is_5_3_1_or_Yes_No))
    #>                                                 col2
    #> 1 5='far too much' 3='just right' 1='far too little'
    #> 2                                                  2
    #> 3                                                  1
    #>                                                col3           col5
    #> 1 5='far too thick' 3='just right' 1='far too thin' 1='Yes' 2='No'
    #> 2                                                 4              1
    #> 3                                                 5              1
    

    Data from the OP

    df <- data.frame(col1 = c("10='strongly agree' 0='strongly disagree'", "3", "5"),
                     col2 = c("5='far too much' 3='just right' 1='far too little'", "2", "1"),
                     col3 = c("5='far too thick' 3='just right' 1='far too thin'", "4", "5"),
                     col4 = c("10='strongly agree' 0='strongly disagree'", "8", "7"),
                     col5 = c("1='Yes' 2='No'", "1", "1"))
    

    If you want to check if the string is present in any() of the rows we can rewrite the functions as follows:

    is_10_to_0 <- function(x) {
      any(grepl("(.*5.*3.*1.*)|(.*Yes.*No.*)", x))
    }
    

    Created on 2023-02-20 by the reprex package (v2.0.1)