Search code examples
rregexif-statementstring-parsing

Subset Columns based on partial matching of column names in the same data frame


I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.

Here is a small explanation of my required output. It is described below,

Lets say the data frame is eatable

fruits_area   fruits_production  vegetable_area   vegetable_production 

12             100                26               324
33             250                40               580
66             510                43               581

eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
          "vegetable_production")

I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.

checkExpression <- function(dataset,str){
    dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}

checkExpression(eatable,"your_string")

The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.

Edit:- I think regular expressions would work here.


Solution

  • You could try:

    v <- unique(substr(names(eatable), 0, 5))
    lapply(v, function(x) eatable[grepl(x, names(eatable))])
    

    Or using map() + select_()

    library(tidyverse)
    map(v, ~select_(eatable, ~matches(.)))
    

    Which gives:

    #[[1]]
    #  fruits_area fruits_production
    #1          12               100
    #2          33               250
    #3         660               510
    #
    #[[2]]
    #  vegetables_area vegetable_production
    #1              26                  324
    #2              40                  580
    #3              43                  581
    

    Should you want to make it into a function:

    checkExpression <- function(df, l = 5) {
      v <- unique(substr(names(df), 0, l))
      lapply(v, function(x) df[grepl(x, names(df))])
    }
    

    Then simply use:

    checkExpression(eatable, 5)