Search code examples
rsearchreplacepcafactors

How to search column names of a data frame by a character string and replace the entire column name with a new one (for downstream PCA)


I am trying to create a PCA plot so I want to regroup my columns by batch (so that I cause use my column names as factors). I have read these two (one, two) questions and have tried what they suggested, but it has not worked correctly (or I'm doing something wrong).

What I have is a dataframe with a few thousand columns with sample names like:

Measure    Br_LV_05_BC1_1_POS  Br_Lv_05_BC1_2_POS Br_Lv_05_BC1_3_POS Br_Lv_05_LR_1_POS Br_Lv_05_LR_2_POS
500               3000                8000                5000              1000              2000
600               4000                4000                4000              8000              8000 
700               5000                6000                4000              9000              8000 
800               6000                7000                8000              2000              1000

What I would like to do is perform a search and replace of all columns containing the string "BC1" and renaming that BC1 and same with "LR". This way I can have R use these columns as factors for PCA instead of the PCA measuring each column as an individual sample.

Measure  BC1    BC1     BC1     LR      LR
500      3000   8000    5000    1000    2000
600      4000   4000    4000    8000    8000 
700      5000   6000    4000    9000    8000 
800      6000   7000    8000    2000    1000

That way I can transpose the data (if needed) and cluster my PCA with the samples as factors. I hope I am correct in my thinking. Thank you kindly for you help.


Solution

  • Here is a base R option with sub where wee extract the 4th word from the column names and update it

    names(df1)[-1] <-  sub("^([^_]+_){3}([^_]+)_.*", "\\2", names(df1)[-1])
    names(df1)[-1]
    #[1] "BC1" "BC1" "BC1" "LR"  "LR" 
    

    Or another option is strsplit at _ and extract the 4th element

    names(df1)[-1] <- sapply(strsplit(names(df1)[-1], "_"), `[`, 4)
    

    We can also use word from stringr

    library(stringr)
    names(df1)[-1] <- word(names(df1)[-1], 4, sep="_")
    

    NOTE: It is better not to have duplicate column names and it would be anyway changed in data.frame by the make.unique

    data

    df1 <- structure(list(Measure = c(500L, 600L, 700L, 800L), Br_LV_05_BC1_1_POS = c(3000L, 
    4000L, 5000L, 6000L), Br_Lv_05_BC1_2_POS = c(8000L, 4000L, 6000L, 
    7000L), Br_Lv_05_BC1_3_POS = c(5000L, 4000L, 4000L, 8000L), Br_Lv_05_LR_1_POS = c(1000L, 
    8000L, 9000L, 2000L), Br_Lv_05_LR_2_POS = c(2000L, 8000L, 8000L, 
    1000L)), class = "data.frame", row.names = c(NA, -4L))