This question isn't necessary relevant to the hts
package, but its motivation is derived from the need the specify the hierarchy within column names in the hts
package (function hts
argument "characters")
original data:
library(data.table)
Original<-data.table(column_names=c("12_2985_40_4025", "12_2986_26_4027",
"12_3385_17_4863", "48_2570_433_3376"))
Original[,nchar:=nchar(column_names)]
Original
Original
names nchar
1: 12_2985_40_4025 15
2: 12_2986_26_4027 15
3: 12_3385_17_4863 15
4: 48_2570_433_3376 16
Notice that each row is composed of 4 pasted labels of a single time series build in a hierarchy, for example Original$names[1]: "12_2985_40_4025
, is a a time series of type "12", sub type "2985", sub sub type "40" and unique identifier "4025"
Illustration of the original data hierarchy:
The character argument requires that :
Integers indicate the segments in which the bottom level names can be read in order to construct the corresponding node structure and its labels. For instance, suppose one of the bottom series is named "VICMelb" referring to the city of Melbourne within the state of Victoria. Then characters would be specified as c(3, 4) referring to states of 3 characters (e.g., "VIC") and cities of 4 characters (e.g., "Melb") All the bottom names must be of the same length, with number of characters for each segment the same for all series.
So i need to convert "Original" format into "required" format, so i can further input it into an hts
object, noticed that I've added "l" (can be any character) in order to create the same length to all sub sub level:
required<-data.table(names=c("12_2985_40l_4025", "12_2986_26l_4027",
"12_3385_17l_4863", "48_2570_433_3376"))
required[,nchar:=nchar(names)]
required
required
names nchar
1: 12_2985_40l_4025 16
2: 12_2986_26l_4027 16
3: 12_3385_17l_4863 16
4: 48_2570_433_3376 16
So now the following code from hts
would work, since, each "names" would be split into 4 levels of the length: 3,5,4,4 (including underscore) :
library(hts)
abc <- ts(5 + matrix(sort(rnorm(1000)), ncol = 4, nrow = 100))
colnames(abc) <- required$names
y <- hts(abc, characters=c(3,5,4,4)) #this would work after properly fixing
Alert_forecast <- forecast(y, h=10, method="comb")
plot(Alert_forecast, include=10)
General solution that i though of: (Although i really didn't manage to formulate it properly into code, defiantly not an elegant one ) In order to convert it to the proper formatting, i thought of finding the maximum of all 4 levels first (for all values of "names"), then run a loop over all "names" and split each level in a loop, and if its shorter that its level paste the necessary ll's so it will have the same names length as all the other TS in its equivalent level.
Here's an attempt to solve this using the stringi
package
library(data.table) #V 1.9.6+
library(stringi)
Original[, tstrsplit(column_names, "_", fixed = TRUE)
][, lapply(.SD, function(x) stri_pad_right(x, max(nchar(x)), "l"))
][, do.call(paste, c(sep = "_", .SD))]
## [1] "12_2985_40l_4025" "12_2986_26l_4027" "12_3385_17l_4863" "48_2570_433_3376"
The idea here is to: split by _
> find maximum length per column > pad l
s to the shorter value > combine everything back with the _
separator.