I regularly work with large datasets which are awkwardly labelled with a column heading and then a subheading in the second row. When imported into R, this means most of the data is converted to character automatically, but this isn't the issue, as I am able to correct this later. The issue is trying to use the subheadings in the first row to select certain columns for further analysis.
Here is some example data:
df <- data.frame(col1 = c("10='strongly agree' 0='strongly disagree'", "3", "5"),
col2 = c("5='far too much' 3='just right' 1='far too little'", "2", "1"),
col3 = c("5='far too thick' 3='just right' 1='far too thin'", "4", "5"),
col4 = c("10='strongly agree' 0='strongly disagree'", "8", "7"),
col5 = c("1='Yes' 2='No'", "1", "1"))
I want to be able to select columns based on (sub)strings/patterns found in the first row. The actual data has hundreds of columns, so it would be much more efficient to be able to do this in a few simple lines of code rather than manually selecting the columns.
A couple of possible selections I might need to make using the example data:
In the case of the above, the outputs would end up the same as the following:
df_1 <- data.frame(col1 = c("10='strongly agree' 0='strongly disagree'", "3", "5"),
col4 = c("10='strongly agree' 0='strongly disagree'", "8", "7"))
df_2 <- data.frame(col2 = c("5='far too much' 3='just right' 1='far too little'", "2", "1"),
col3 = c("5='far too thick' 3='just right' 1='far too thin'", "4", "5"),
col5 = c("1='Yes' 2='No'", "1", "1"))
It would also be useful to know how to do this to check anywhere in a column rather than just for the first row, but this isn't important.
Thanks in advance!
We can create custom tidyselect functions and use them inside dplyr::select(where(...))
:
library(dplyr)
is_10_to_0 <- function(x) {
dplyr::first(x) == "10='strongly agree' 0='strongly disagree'"
}
is_5_3_1_or_Yes_No <- function(x) {
grepl("(.*5.*3.*1.*)|(.*Yes.*No.*)", dplyr::first(x))
}
library(dplyr)
df %>%
select(where(is_10_to_0))
#> col1
#> 1 10='strongly agree' 0='strongly disagree'
#> 2 3
#> 3 5
#> col4
#> 1 10='strongly agree' 0='strongly disagree'
#> 2 8
#> 3 7
df %>%
select(where(is_5_3_1_or_Yes_No))
#> col2
#> 1 5='far too much' 3='just right' 1='far too little'
#> 2 2
#> 3 1
#> col3 col5
#> 1 5='far too thick' 3='just right' 1='far too thin' 1='Yes' 2='No'
#> 2 4 1
#> 3 5 1
Data from the OP
df <- data.frame(col1 = c("10='strongly agree' 0='strongly disagree'", "3", "5"),
col2 = c("5='far too much' 3='just right' 1='far too little'", "2", "1"),
col3 = c("5='far too thick' 3='just right' 1='far too thin'", "4", "5"),
col4 = c("10='strongly agree' 0='strongly disagree'", "8", "7"),
col5 = c("1='Yes' 2='No'", "1", "1"))
If you want to check if the string is present in any()
of the rows we can rewrite the functions as follows:
is_10_to_0 <- function(x) {
any(grepl("(.*5.*3.*1.*)|(.*Yes.*No.*)", x))
}
Created on 2023-02-20 by the reprex package (v2.0.1)