Search code examples
rsortingcharacteralphabetical

How can I sort the columns of a large data frame based on specific criteria?


I would like to sort the columns of a large dataframe (around 14000 variables) based on a specific rationale.

The column names have the following structure (Condition_Sleepstage_Parameter_Electrode_Nightpart):

 [1] "Adapt_N2_negLengthLoc_C3_firstHour"          "Adapt_N3_negLengthLoc_C3_firstHour"         
  [3] "Adapt_NREM_negLengthLoc_C3_firstHour"        "Book_N2_negLengthLoc_C3_firstHour"          
  [5] "Book_N3_negLengthLoc_C3_firstHour"           "Book_NREM_negLengthLoc_C3_firstHour"

R has the columns ordered in a purely alphabetical structure but I would like to have them in a logical order based on the following system:

First, variables should be presented in blocks per parameter. (order: "negLengthLoc", "posLength", "wholeLength", "negPeak", "nbnegPeaks", "initialMeannegSlope", "finalMeannegSlope", "initialMaxnegslope", "finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope", "finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")

Within these blocks, the highest order hierarchy should be due to Condition. (order: "Adapt", "NoFilter", "Filter", "Book").

Thereafter, the next hierarchy should be defined by Electrode. (order: "F3", "Fz", "F4", "C3", "Cz", "C4", "P3", "Pz", "P4", "O1", "O2").

Thereafter by Nightpart (order:"firstHour", "firstQuarter", "secondQuarter", "thirdQuarter", "fourthQuarter", "wholeNight") and lastly by Sleepstage ("order: "N2", "N3", "NREM").

The resulting order should look like:

[1] "Adapt_N2_negLengthLoc_F3_firstHour"          "Adapt_N3_negLengthLoc_F3_firstHour"
[3] "Adapt_NREM_negLengthLoc_F3_firstHour"          "Adapt_N2_negLengthLoc_F3_firstQuarter"
[5] "Adapt_N3_negLengthLoc_F3_firstQuarter"          "Adapt_NREM_negLengthLoc_F3_firstQuarter"
[7] "Adapt_N2_negLengthLoc_F3_secondQuarter"          "Adapt_N3_negLengthLoc_F3_secondQuarter"
[9] "Adapt_NREM_negLengthLoc_F3_secondQuarter"          "Adapt_N2_negLengthLoc_F3_thirdQuarter"
[11] "Adapt_N3_negLengthLoc_F3_thirdQuarter"          "Adapt_NREM_negLengthLoc_F3_thirdQuarter"
[13] "Adapt_N2_negLengthLoc_F3_fourthQuarter"          "Adapt_N3_negLengthLoc_F3_fourthQuarter"
[15] "Adapt_NREM_negLengthLoc_F3_fourthQuarter"          "Adapt_N2_negLengthLoc_F3_wholeNight"
[17] "Adapt_N3_negLengthLoc_F3_wholeNight"           "Adapt_NREM_negLengthLoc_F3_wholeNight"
[19] "Adapt_N2_negLengthLoc_Fz_firstHour"           "Adapt_N3_negLengthLoc_Fz_firstHour"
...

I hope that someone can help me, if there are any further questions then I am of course happy to provide more information!

Thanks in advance!


Solution

  • Illustrating with the mtcars data, one can reorder the columns in a data frame by creating a vector with the the desired order, and using this in the column specification with the [ form of the extract operator.

    First, we'll use colnames() to extract the original order of the columns and print them

    theNames <- colnames(mtcars)
    theNames
    
    > theNames
     [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
    

    Next, we'll move all of the integer columns to the left side of the data frame by creating a reorderedNames vector and using it with [.

    reorderedNames <- c("cyl" , "vs" ,  "am" ,  "gear" ,"carb","disp" ,
                         "drat", "wt" ,  "qsec", "mpg")
    mtcars[,reorderedNames]
    

    ...and the first few rows of output:

    > mtcars[,reorderedNames]
                        cyl vs am gear carb  disp drat    wt  qsec  mpg
    Mazda RX4             6  0  1    4    4 160.0 3.90 2.620 16.46 21.0
    Mazda RX4 Wag         6  0  1    4    4 160.0 3.90 2.875 17.02 21.0
    Datsun 710            4  1  1    4    1 108.0 3.85 2.320 18.61 22.8
    Hornet 4 Drive        6  1  0    3    1 258.0 3.08 3.215 19.44 21.4
    Hornet Sportabout     8  0  0    3    2 360.0 3.15 3.440 17.02 18.7
    Valiant               6  1  0    3    1 225.0 2.76 3.460 20.22 18.1
    Duster 360            8  0  0    3    4 360.0 3.21 3.570 15.84 14.3
    Merc 240D             4  1  0    4    2 146.7 3.69 3.190 20.00 24.4
    

    Automating this process for a large data frame

    In the OP, the question references a data frame with a large number of columns. In order to extend this process to automate the sorting of columns, there are at least two major approaches.

    1. Process the column names from the data frame in a manner that allows one to sort the column names in the desired order, or
    2. Create a narrow format tidy data set by using pivot_longer() to split the column names into the required grouping variables.

    We'll use data from the OP to illustrate approach 1.

    columnName <- c("Adapt_N2_negLengthLoc_C3_firstHour","Adapt_N3_negLengthLoc_C3_firstHour",         
    "Adapt_NREM_negLengthLoc_C3_firstHour","Book_N2_negLengthLoc_C3_firstHour",          
    "Book_N3_negLengthLoc_C3_firstHour","Book_NREM_negLengthLoc_C3_firstHour")
    splitCols <- strsplit(columnName,"_")
    
    results <- lapply(splitCols,function(x){
         parameter <- x[3]
         condition <- x[1]
         electrode <- x[4]
         nightpart <- x[5]
         sleepstage <- x[2]
         data.frame(parameter,condition,electrode,nightpart,sleepstage)
    })
    
    colsData <- do.call(rbind,results)
    # add original column names back into data
    colsData <- cbind(columnName,colsData)
    # convert to factors, specifying the factor order for sorting
    conditionOrder <- c("Adapt", "NoFilter", "Filter", "Book")
    parameterOrder <- c("negLengthLoc", "posLength", "wholeLength", "negPeak", "nbnegPeaks", 
                        "initialMeannegSlope", "finalMeannegSlope", "initialMaxnegslope", 
                        "finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope",
                        "finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")
    electrodeOrder <- c("F3", "Fz", "F4", "C3", "Cz", "C4", "P3", "Pz", "P4", "O1", "O2")
    nightpartOrder <- c("firstHour", "firstQuarter", "secondQuarter", "thirdQuarter", "fourthQuarter", "wholeNight") 
    sleepstageOrder <- c("N2", "N3", "NREM")
    colsData$condition <- factor(colsData$condition,levels = conditionOrder,ordered = TRUE)
    colsData$parameter <- factor(colsData$parameter,levels = parameterOrder,ordered = TRUE)
    colsData$electrode <- factor(colsData$electrode,levels = electrodeOrder,ordered = TRUE)
    colsData$nightpart <- factor(colsData$nightpart,levels = nightpartOrder,ordered = TRUE)
    colsData$sleepstage <- factor(colsData$sleepstage,levels = sleepstageOrder,ordered = TRUE)
    
    # finally, sort by factors & create a vector for column number
    library(dplyr)
    colsData <- arrange(colsData,condition,parameter,electrode,nightpart,sleepstage)
    colsData$colId <- 1:nrow(colsData)
    colsData
    

    ...and the output:

    > colsData
                                columnName    parameter condition electrode nightpart
    1   Adapt_N2_negLengthLoc_C3_firstHour negLengthLoc     Adapt        C3 firstHour
    2   Adapt_N3_negLengthLoc_C3_firstHour negLengthLoc     Adapt        C3 firstHour
    3 Adapt_NREM_negLengthLoc_C3_firstHour negLengthLoc     Adapt        C3 firstHour
    4    Book_N2_negLengthLoc_C3_firstHour negLengthLoc      Book        C3 firstHour
    5    Book_N3_negLengthLoc_C3_firstHour negLengthLoc      Book        C3 firstHour
    6  Book_NREM_negLengthLoc_C3_firstHour negLengthLoc      Book        C3 firstHour
      sleepstage colId
    1         N2     1
    2         N3     2
    3       NREM     3
    4         N2     4
    5         N3     5
    6       NREM     6
    > 
    

    At this point we can use colsData$columnName to order the original data frame.