Search code examples
rmatrixdataframer-factor

How to make the levels of a factor in a data frame consistent across all columns?


I have a data frame with 5 different columns:

         Test1   Test2   Test3  Test4  Test5 
Sample1  PASS    PASS    FAIL    WARN   WARN
Sample2  PASS    PASS    FAIL    PASS   WARN
Sample3  PASS    FAIL    FAIL    PASS   WARN
Sample4  PASS    FAIL    FAIL    PASS   WARN
Sample5  PASS    WARN    FAIL    WARN   WARN

In each column, each level is assigned a different factor. In column 1, "PASS" is 1. In column 2, "PASS" is 2 and "FAIL is 1. In column 3, "FAIL" is 1. In column 4, "PASS" is 1 and "WARN" is 2. In column 5, "WARN" IS 1.

It is doing it by alphabetical order I need "PASS" be 1 in all columns, "WARN" to be 2 in all columns, and "FAIL" 3 in all columns, so that I can then convert into a matrix and turn it into a heatmap.

Currently it is assigning the factors to the levels depending on which ones show up in a specific column, and by alphabetical order.

How can I keep it constant throughout the entire data frame?


Solution

  • You could change the levels of the dataset "df" to be in the same order by looping (lapply) and convert to factor again with the specified levels and assign it back to the corresponding columns.

    lvls <- c('PASS', 'WARN', 'FAIL')
    df[] <-  lapply(df, factor, levels=lvls)
    str(df)
    # 'data.frame': 5 obs. of  5 variables:
    # $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
    # $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
    # $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
    # $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
    # $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2
    

    If you opt to use data.table

    library(data.table)
    setDT(df)[, names(df):= lapply(.SD, factor, levels=lvls)]
    

    setDT converts to "data.frame" to "data.table", assign (:=) the column names of the dataset to the reconverted factor columns (lapply(..)). .SD denotes "Subset of Datatable".

    data

    df <- structure(list(Test1 = structure(c(1L, 1L, 1L, 1L, 1L), 
    .Label = "PASS", class = "factor"), 
      Test2 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("FAIL", 
     "PASS", "WARN"), class = "factor"), Test3 = structure(c(1L, 
     1L, 1L, 1L, 1L), .Label = "FAIL", class = "factor"), Test4 = 
     structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("PASS", "WARN", "FAIL"), 
     class = "factor"), Test5 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = 
    "WARN", class = "factor")), .Names = c("Test1", 
    "Test2", "Test3", "Test4", "Test5"), row.names = c("Sample1", 
    "Sample2", "Sample3", "Sample4", "Sample5"), class = "data.frame")