Search code examples
rdplyrlevelsdata-mapping

Mapping non-numeric factor to choose higher value between two columns in R


I have a dataframe with two column: PathGroupStage, ClinGroupStage. I want to create a new column, OutputStage, that chooses the higher stage.

Valid value of stage: I, IA, IB, II, IIA, IIB, III, IIIA, IIIB, IIIC ,IV, IVA, IVB, IVC, Unknown.

  • If both stages have values, then use the highest, e.g., IIIB > IIIA > III
  • If one is missing and the other has value, the use the one with value
  • If both are missing or unknown, then .= unknown

How would I derive the OutputStage variable comparing the non-numeric values from the two columns? I am thinking I need to factor levels but how would I compare the factors between different columns?

Here is the sample dataset:

   PathGroupStage       ClinGroupStage
1              II                 <NA>
2               I                   IA
3             IVB                  IVB
4            IIIA Unknown/Not Reported
5               I                  III
6              II                 <NA>
7            IIIA                  IIB
8              II                   II
9            <NA>                 <NA>
10           IIIB Unknown/Not Reported

 df <- structure(list(PathGroupStage = c("II", "I", "IVB", "IIIA", "I", 
    "II", "IIIA", "II", NA, "IIIB"), ClinGroupStage = c(NA, "IA", 
    "IVB", "Unknown/Not Reported", "III", NA, "IIB", "II", NA, "Unknown/Not Reported"
    )), row.names = c(NA, 10L), class = "data.frame")

Solution

  • One option could be:

    stages <- c("Unknown/Not Reported", "I", "IA", "IB", "II", "IIA", "IIB", "III", "IIIA", "IIIB", "IIIC" ,"IV", "IVA", "IVB", "IVC")
    
    df %>%
        mutate(across(everything(), ~ factor(., levels = stages, ordered = TRUE)),
               OutputStage = pmax(PathGroupStage, ClinGroupStage, na.rm = TRUE))
    
       PathGroupStage       ClinGroupStage OutputStage
    1              II                 <NA>          II
    2               I                   IA          IA
    3             IVB                  IVB         IVB
    4            IIIA Unknown/Not Reported        IIIA
    5               I                  III         III
    6              II                 <NA>          II
    7            IIIA                  IIB        IIIA
    8              II                   II          II
    9            <NA>                 <NA>        <NA>
    10           IIIB Unknown/Not Reported        IIIB