Search code examples
rfor-loopdata.tableapplysql-like

R function for matching rows in dataframe to pattern in another dataframe and adding grouping variable


I am trying to create a function that will, for every row in a given dataframe, match a pattern with %like% to the contents of a column in that row and then paste a specific category name into the column. The aim is to have a function that I can use to match a list of genes and their descriptions to types of proteins coded for in groupings.

This function should, for each row in the DATA, scan it against each row in the LIST and if the column in DATA named in LIST$column matches the LIST$pattern for that row, then the name in LIST$name should be pasted into DATA$Function.

So for this as the LIST:

testdf <- data.frame(pattern = c("vers", "set", "abc"),
                     name = c("Versicolor Iris", "Setosa Iris", "Alphabet Iris"),
                     column = "Species")

running the code on iris as the DATA should return this for the first few rows:

 Sepal.Length Sepal.Width Petal.Length Petal.Width Species Function
1          5.1         3.5          1.4         0.2  setosa  Setosa Iris
2          4.9         3.0          1.4         0.2  setosa  Setosa Iris
3          4.7         3.2          1.3         0.2  setosa  Setosa Iris
4          4.6         3.1          1.5         0.2  setosa  Setosa Iris

My original clunky code to do this manually was like this:

group_1 <- iris|> filter(Species %like% "set")
group1$Function <- "Setosa Iris"

and then using rbind at the end to recombine all the groups.

I want to be able to create a csv file with all potential grouping information (column to match, pattern to match, name of grouping variable to paste in) and run my function to compare my data to this file line by line to create a grouping variable.

The column name part in the LIST is unfortunately vital as I want the rows to be scanned in order and some may isolate a single gene by its gene ID rather than all genes of that type as per the description, and gene IDs and descriptions are in different columns.

Currently this is my function, but it does not work and uses a for loop that might end up being a bottleneck:

require(data.table)

testdf <- data.frame(pattern = c("vers", "set", "abc"),
                     name = c("Versicolor Iris", "Setosa Iris", "Alphabet Iris"),
                     column = "Species", "Species", "Petal.Width")

Function2 <- function(data, list, groupingcategoryname="Function"){
  
  groups <- data
  
  groups$Function <- "1"
  
  group_by_desc <- function(x){
    for (i in {{list}}$column) {if(groups[[i]] %like% {{list}}$pattern) {
      groups$Function <- {{list}}$name
    }else {groups$Function <- "no match"}
  }
  }
  data <- apply(X=data, 1, FUN=group_by_desc)
  
}

test <- Function2(data=iris, list=testdf)

Solution

  • I think a for loop is a fine approach here. The data.table::"%like%"() operator is just a wrapper around grepl, so I'll eliminate the dependency on data.table and use grepl directly. And then we can replace directly, defining a function inside the function doesn't seem necessary.

    Function2 <- function(data, list, groupingcategoryname="Function"){
      ## create column with default value
      data[[groupingcategoryname]] = "no match"
    
      ## loop through `list` data frame making replacements
      for(i in 1:nrow(list)) {
        data[
          ## rows
          grepl(pattern = list$pattern[i], x = data[[list$column[i]]]), 
          ## column
          groupingcategoryname
        ] = list$name[i] ## new value
      }
      data
    }
    Function2(iris, testdf)
    #     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species        Function
    # 1            5.1         3.5          1.4         0.2     setosa     Setosa Iris
    # 2            4.9         3.0          1.4         0.2     setosa     Setosa Iris
    # 3            4.7         3.2          1.3         0.2     setosa     Setosa Iris
    # 4            4.6         3.1          1.5         0.2     setosa     Setosa Iris
    # ...