I have a data.frame (PC) that looks like this:
https://i.sstatic.net/NWJKe.png
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
https://i.sstatic.net/vQ48u.png
I want to sort the columns (beginning with "GTEX.") in the data.frame such that they are ordered by the age indicated in the age matrix.
PC <- read.csv("protein_coding.csv")
age <- read.table("Annotations_SubjectPhenotypes_DS.txt")
I started by changing the names in the age matrix to replace the '-' by '.':
new_SUBJID <- gsub("-", ".", age$SUBJID, fixed = TRUE)
age[, "SUBJID"] <- new_SUBJID
Then, I ordered the row names (SUBJUD) of the age matrix by age:
sort.age <- with(age, age[order(AGE) , ])
sort.age <- na.omit(sort.age)
I then created a vector age.ID containing the SUBJIDs in the right order (= how I want to order the columns from the PC matrix).
age.id <- sort.age$SUBJID
But then I am blocked since the names on the PC matrix and the age matrix are not the same... Could someone please help me?
Thank you very much in advance! Svalf
It would have been better to show the example without using an image. Suppose, if there are two strings,
str1 <- c('GTEX.N7MS.0007.SM.2D7W1', 'GTEX.PFPP.0007.SM.2D8W1', 'GTEX.N7MS.0008.SM.4E3J1')
str2 <- c('GTEX.N7MS', 'GTEX.PFPP')
representing the column names of 'PC' and the 'SUBJID' column of 'age' dataset (after replacing the -
with .
and sort
ed), we remove the suffix part by matching the .
followed by 4 digits (\\d{4}
) followed by one or more characters to the end of the string (.*$
) and replace it by ''
.
str1N <- sub('\\.\\d{4}.*$', '', str1)
str1[order(match(str1N, str2))]
#[1] "GTEX.N7MS.0007.SM.2D7W1" "GTEX.N7MS.0008.SM.4E3J1"
#[3] "GTEX.PFPP.0007.SM.2D8W1"