Search code examples
rtidyversecategorical-dataforcats

Dealing with QY data: How to define levels of QY factors


I have a dataset with Year, Quarter, QY columns and many numeric variables.

#example dataset
Year = c("2019", "2020", "2021",
         "2019", "2020", "2021",
         "2019", "2020", "2021",
         "2019", "2020", "2021")
Quarter = c("1Q", "1Q", "1Q",
            "2Q", "2Q", "2Q",
            "3Q", "3Q", "3Q",
            "4Q", "4Q", "4Q")
QY = c("1Q19", "1Q20", "1Q21",
                   "2Q19", "2Q20", "2Q21",
                   "3Q19", "3Q20", "3Q21",
                   "4Q19", "4Q20", "4Q21")
VAR1 = c(10, 20, 30,
         30, 20, 25,
         27, 10, 15,
         13, 34, 25)

df <- data.frame(Year, Quarter, QY, VAR1)

I need to turn all character columns to factor. Year and Quarter columns seem to rightfully have the correct level when converted but not QY column, so I defined it manually.

df$Year <- as.factor(df$Year)
df$Quarter <- as.factor(df$Quarter) 
df$QY <- as.factor(df$QY) 

#check the level, QY is incorrect
sapply(df, str)

#manually defined QY level
df$QY <- factor(df$QY,
                levels = c("1Q19", "2Q19", "3Q19", "4Q19",
                           "1Q20",  "2Q20", "3Q20", "4Q20",
                           "1Q21",  "2Q21", "3Q21", "4Q21"))

Is there a more efficient way that R can automatically identify the levels of these QY without me manually defining it? Especially as the QY grows, I need to define it from 2019.

I don't see the need to change any of the Year, Quarter, QY column to time variable but do clarify if I should do that instead. My calculation and data visualization needs are mostly on Year-to-Year changes or QY-to-QY changes.


Solution

  • For your example data you could arrange your data by Year and Quarter then use forcats::fct_inorder to set the levels for QY:

    library(dplyr, warn=FALSE)
    library(forcats)
    
    df <- df |> 
      arrange(Year, Quarter) |> 
      mutate(QY = fct_inorder(QY))
    
    levels(df$QY)
    #>  [1] "1Q19" "2Q19" "3Q19" "4Q19" "1Q20" "2Q20" "3Q20" "4Q20" "1Q21" "2Q21"
    #> [11] "3Q21" "4Q21"
    

    Or using base R you could achieve the same result like so:

    df <- df[order(df$Year, df$Quarter), ]
    df$QY <- factor(df$QY, levels = unique(df$QY))
    
    levels(df$QY)
    #>  [1] "1Q19" "2Q19" "3Q19" "4Q19" "1Q20" "2Q20" "3Q20" "4Q20" "1Q21" "2Q21"
    #> [11] "3Q21" "4Q21"
    

    And in case you don't want to avoid reordering the dataset you could do (thanks to @RuiBarradas for the suggestion):

    ix <- order(df$Year, df$Quarter)
    df$QY <- factor(df$QY, levels = unique(df$QY[ix]))
    
    levels(df$QY)
    #>  [1] "1Q19" "2Q19" "3Q19" "4Q19" "1Q20" "2Q20" "3Q20" "4Q20" "1Q21" "2Q21"
    #> [11] "3Q21" "4Q21"