Search code examples
rregex

Similar string pattern is recognized as one and changing column names incorrectly even if I use word boundaries approach


I have a data frame with an specific pattern in its column names:

wssgroup_1_norm
wxsgroup_2_norm
wargroup_3_norm
wetgroup_10_norm
wegroup_11_norm

I used an approach that recognized the group_* (*: any number from 1 to 11) string, then, replace the string with another string.

IMPORTANT: please notice that group_* is bind to more text without an underscore in between, like this: wssgroup_1.

Here is the code:

# Mapping of group names to replacement strings
group_names_l <- list(
  group_1 = "_IgG1", group_2 = "_IgG2", group_3 = "_IgG3", group_4 = "_IgG4",
  group_5 = "_IgA1", group_6 = "_IgM", group_7 = "_FcyR2", group_8 = "_FcyR2b",
  group_9 = "_FcyR3av", group_10 = "_FcyR3b", group_11 = "_C1q"
)

# Replace column names using the group_names_l mapping
colnames(luminex_data_o_v2) <- sapply(colnames(luminex_data_o_v2), function(col) 
{
  for (key in names(group_names_l)) {
    if (grepl(key, col)) {
      return(gsub(key, group_names_l[[key]], col))
    }
  }
  return(col)
})

This approach works partially fine, because I realized that the group 1, 10 and 11 is recognized as one and replaced as _IgG1 in all three cases. So the new column names are partially correct, and my new df have incorrect column names in the groups_10 and 11.

Then, I receive some advice to replace the string using word boundaries (from StackOverflow), one approach was this:

gsub(paste0("^", key,"$"), group_names_l[[key]], col))

or this:

gsub(paste0("\\b", key,"\\b"), group_names_l[[key]], col)

However, this two new approach do nothing in my df, are currently not recognizing nothing on my column names.

Questions:

  • What can I do to avoid this confusion?
  • Why word boundaries is not working properly?
  • What am I doing wrong?

Solution

  • It is clear what is going on when you actually look at the group_names_l list:

    • Each item starts with a word char _ (and it is attached on the left to another word char in your data), so no word boundary check is necessary here, on the left side
    • Each item also ends with a digit, so you want a digit boundary on the right side.

    In the end, all you want is to check if there is no digit on the right:

    colnames(df) <- sapply(colnames(df), function(col) 
    {
      for (key in names(group_names_l)) {
        if (grepl(paste0(key, "(?!\\d)"), col, perl=TRUE)) {
          return(sub(paste0(key, "(?!\\d)"), group_names_l[[key]], col, perl=TRUE))
        }
      }
      return(col)
    })
    

    I added perl=TRUE to both grep and sub commands since the patterns require a PCRE regex engine.

    I replaced gsub with sub since you expect a single replacement in the string anyway.