Search code examples
rtime-seriesforecasting

Create proper input names for the hts character argument R


This question isn't necessary relevant to the hts package, but its motivation is derived from the need the specify the hierarchy within column names in the hts package (function hts argument "characters")

original data:

library(data.table)
Original<-data.table(column_names=c("12_2985_40_4025", "12_2986_26_4027", 
          "12_3385_17_4863", "48_2570_433_3376"))
Original[,nchar:=nchar(column_names)]
Original

Original

              names nchar
1:  12_2985_40_4025    15
2:  12_2986_26_4027    15
3:  12_3385_17_4863    15
4: 48_2570_433_3376    16

Notice that each row is composed of 4 pasted labels of a single time series build in a hierarchy, for example Original$names[1]: "12_2985_40_4025, is a a time series of type "12", sub type "2985", sub sub type "40" and unique identifier "4025"

Illustration of the original data hierarchy:

enter image description here The character argument requires that :

Integers indicate the segments in which the bottom level names can be read in order to construct the corresponding node structure and its labels. For instance, suppose one of the bottom series is named "VICMelb" referring to the city of Melbourne within the state of Victoria. Then characters would be specified as c(3, 4) referring to states of 3 characters (e.g., "VIC") and cities of 4 characters (e.g., "Melb") All the bottom names must be of the same length, with number of characters for each segment the same for all series.

So i need to convert "Original" format into "required" format, so i can further input it into an hts object, noticed that I've added "l" (can be any character) in order to create the same length to all sub sub level:

required<-data.table(names=c("12_2985_40l_4025", "12_2986_26l_4027", 
                             "12_3385_17l_4863", "48_2570_433_3376"))
required[,nchar:=nchar(names)]
required

required

              names nchar
1: 12_2985_40l_4025    16
2: 12_2986_26l_4027    16
3: 12_3385_17l_4863    16
4: 48_2570_433_3376    16

So now the following code from hts would work, since, each "names" would be split into 4 levels of the length: 3,5,4,4 (including underscore) :

library(hts)
abc <- ts(5 + matrix(sort(rnorm(1000)), ncol = 4, nrow = 100))
colnames(abc) <- required$names
y <- hts(abc, characters=c(3,5,4,4)) #this would work after properly fixing 
Alert_forecast <- forecast(y, h=10, method="comb")
plot(Alert_forecast, include=10)

General solution that i though of: (Although i really didn't manage to formulate it properly into code, defiantly not an elegant one ) In order to convert it to the proper formatting, i thought of finding the maximum of all 4 levels first (for all values of "names"), then run a loop over all "names" and split each level in a loop, and if its shorter that its level paste the necessary ll's so it will have the same names length as all the other TS in its equivalent level.


Solution

  • Here's an attempt to solve this using the stringi package

    library(data.table) #V 1.9.6+
    library(stringi)
    Original[, tstrsplit(column_names, "_", fixed = TRUE)
             ][, lapply(.SD, function(x) stri_pad_right(x, max(nchar(x)), "l"))
               ][, do.call(paste, c(sep = "_", .SD))]
    
    ## [1] "12_2985_40l_4025" "12_2986_26l_4027" "12_3385_17l_4863" "48_2570_433_3376"
    

    The idea here is to: split by _ > find maximum length per column > pad ls to the shorter value > combine everything back with the _ separator.