Search code examples
rmatrixvectorcolumnsorting

Order df columns according to a target vector (but the names match only partially)


I have a data.frame (PC) that looks like this:

https://i.sstatic.net/NWJKe.png

which has 1000+ columns with similar names.

And I have a vector of those column names that looks like this:

https://i.sstatic.net/vQ48u.png

I want to sort the columns (beginning with "GTEX.") in the data.frame such that they are ordered by the age indicated in the age matrix.

PC <- read.csv("protein_coding.csv")
age <- read.table("Annotations_SubjectPhenotypes_DS.txt")

I started by changing the names in the age matrix to replace the '-' by '.':

new_SUBJID <- gsub("-", ".", age$SUBJID, fixed = TRUE)
age[, "SUBJID"] <- new_SUBJID

Then, I ordered the row names (SUBJUD) of the age matrix by age:

sort.age <- with(age,  age[order(AGE) , ])
sort.age <- na.omit(sort.age)

I then created a vector age.ID containing the SUBJIDs in the right order (= how I want to order the columns from the PC matrix).

age.id <- sort.age$SUBJID

But then I am blocked since the names on the PC matrix and the age matrix are not the same... Could someone please help me?

Thank you very much in advance! Svalf


Solution

  • It would have been better to show the example without using an image. Suppose, if there are two strings,

    str1 <- c('GTEX.N7MS.0007.SM.2D7W1', 'GTEX.PFPP.0007.SM.2D8W1', 'GTEX.N7MS.0008.SM.4E3J1') 
    str2 <- c('GTEX.N7MS', 'GTEX.PFPP')
    

    representing the column names of 'PC' and the 'SUBJID' column of 'age' dataset (after replacing the - with . and sorted), we remove the suffix part by matching the . followed by 4 digits (\\d{4}) followed by one or more characters to the end of the string (.*$) and replace it by ''.

     str1N <- sub('\\.\\d{4}.*$', '', str1)
    
    str1[order(match(str1N, str2))]
    #[1] "GTEX.N7MS.0007.SM.2D7W1" "GTEX.N7MS.0008.SM.4E3J1"
    #[3] "GTEX.PFPP.0007.SM.2D8W1"