I am trying to create a PCA plot so I want to regroup my columns by batch (so that I cause use my column names as factors). I have read these two (one, two) questions and have tried what they suggested, but it has not worked correctly (or I'm doing something wrong).
What I have is a dataframe with a few thousand columns with sample names like:
Measure Br_LV_05_BC1_1_POS Br_Lv_05_BC1_2_POS Br_Lv_05_BC1_3_POS Br_Lv_05_LR_1_POS Br_Lv_05_LR_2_POS
500 3000 8000 5000 1000 2000
600 4000 4000 4000 8000 8000
700 5000 6000 4000 9000 8000
800 6000 7000 8000 2000 1000
What I would like to do is perform a search and replace of all columns containing the string "BC1" and renaming that BC1 and same with "LR". This way I can have R use these columns as factors for PCA instead of the PCA measuring each column as an individual sample.
Measure BC1 BC1 BC1 LR LR
500 3000 8000 5000 1000 2000
600 4000 4000 4000 8000 8000
700 5000 6000 4000 9000 8000
800 6000 7000 8000 2000 1000
That way I can transpose the data (if needed) and cluster my PCA with the samples as factors. I hope I am correct in my thinking. Thank you kindly for you help.
Here is a base R
option with sub
where wee extract the 4th word from the column names and update it
names(df1)[-1] <- sub("^([^_]+_){3}([^_]+)_.*", "\\2", names(df1)[-1])
names(df1)[-1]
#[1] "BC1" "BC1" "BC1" "LR" "LR"
Or another option is strsplit
at _
and extract the 4th element
names(df1)[-1] <- sapply(strsplit(names(df1)[-1], "_"), `[`, 4)
We can also use word
from stringr
library(stringr)
names(df1)[-1] <- word(names(df1)[-1], 4, sep="_")
NOTE: It is better not to have duplicate column names and it would be anyway changed in data.frame
by the make.unique
df1 <- structure(list(Measure = c(500L, 600L, 700L, 800L), Br_LV_05_BC1_1_POS = c(3000L,
4000L, 5000L, 6000L), Br_Lv_05_BC1_2_POS = c(8000L, 4000L, 6000L,
7000L), Br_Lv_05_BC1_3_POS = c(5000L, 4000L, 4000L, 8000L), Br_Lv_05_LR_1_POS = c(1000L,
8000L, 9000L, 2000L), Br_Lv_05_LR_2_POS = c(2000L, 8000L, 8000L,
1000L)), class = "data.frame", row.names = c(NA, -4L))