Search code examples
rtime-series

Creating single vectors in R


I am working on multivariate time series data. In order to test the stationarity, like using adf.test and others, I need to get a bunch of single vector time series for each variable and their differences instead of the original data.frame. I learned the codes below:

dif1<- c()
for(i in 1:length(city)){

  dif1[i]= paste(city[i],"diff1",sep=".") 

 } 

for(x in 1:22){

 assign(dif1[x], diff(eval(parse(text = city[x]))[,1]))
}

The outcome like this:

> dif1
[1] "AZ.Phoenix.diff1"       "CA.Los.Angeles.diff1"   "CA.San.Diego.diff1"  "CA.San.Francisco.diff1" "CO.Denver.diff1" 
[6] "DC.Washington.diff1"    "FL.Miami.diff1"         "FL.Tampa.diff1"       "GA.Atlanta.diff1"       "IL.Chicago.diff1"      
[11] "MA.Boston.diff1"        "MI.Detroit.diff1"       "MN.Minneapolis.diff1"  "NC.Charlotte.diff1"     "NV.Las.Vegas.diff1"    
[16] "NY.New.York.diff1"      "OH.Cleveland.diff1"     "OR.Portland.diff1"      "TX.Dallas.diff1"        "WA.Seattle.diff1"      
[21] "Composite.20.diff1"     "National.US.diff1"  

enter image description here

I am very confused about how the codes work above. For example why the index'[,1]' is required in the assign line? I tried to delete it, and it turns out all single vector became empty. Can anyone help me understand how the above codes work? Thanks.

Edit: The answer explained well about what the codes did. This is a very experienced way to call vectorwise data in time series. I have read other textbooks and publications in doing the exactly same way,but this method has been confusing me for a while. It is very easy to use though:

1/ Read each vector time process from the working environment: eval(parse(text = city[x])), where x refers to each vector saved in the working environment;

2/ Attend function result onto vector data: assign(dif1[x], diff(eval(parse(text = city[x]))[,1])) .

This method cannot return a data.frame or matrix, but rather, it will return one-by-one vector results and save them in the working environment. It is a different strategy by using lapply() to work on a data.frame. One can think about this as the way to break out the data.frame or matrix by each feature column. And call each feature from the memory to the function. In this way, one can work out multivariate time series data without any trouble and report each feature's analytical result. Time series analysis is very rarely working on a dataframe, this may make many analyst(including myself) struggling at the first time. I provided two sources below:

1/ What does the predominant timeseries data look like? Check out the graph and data at here: https://fred.stlouisfed.org/series/CORESTICKM159SFRBATL;

2/ The forecasting time series analysis open course is here: https://ocw.mit.edu/courses/14-384-time-series-analysis-fall-2013/


Solution

  • What's happening in your code

    dif1 <- c()
    for(i in 1:length(city)){
      dif1[i]= paste(city[i],"diff1",sep=".") 
    } 
    

    Here, you are iterating through each value in the city vector (a character vector with values "AZ.Phoenix" all the way to "National.US" and tacking on ".diff" to the end of each value. The results are being stored in dif1, which is now a character vector containing 22 elements "AZ.Phoenix.diff" ... "National.US.diff".

    Then we have the crazy part:

    for(x in 1:22){
     assign(paste(dif1[x]), diff((eval(parse(text = city[x])))[,1]))
    }
    

    Let's go from the inside out.

    parse(text = city[x])
    

    There is a difference in R between the string:

    "print(5 + 5)"
    

    and the actual code:

    print(5 + 5)
    

    Using parse, you can convert the string such that it is actually treated as an unevaluated expression. So

    parse(text = "print(5 + 5)")
    

    Would be converted into:

    expression(print(5 + 5))
    

    And to actually have that expression run, you'd need to evaluate it by using eval:

    eval(parse(text = "print(5+5)")
    

    To actually see "10" get printed out. In your case

    eval(parse(text = city[x]))
    

    When x has a value of 1 will basically be treated by R as the code:

    AZ.Phoenix.diff1
    

    I think you may have an extra set of brackets in there that don't really do anything, which gets us to:

    (eval(parse(text = city[x])))
    

    So let's just look at the code again as if we weren't doing any for loop and just working with x = 1 such that city[x] is "AZ.Phoenix" and dif1[x] is "AZ.Phoenix.diff1".

    assign(paste("AZ.Phoenix.dif1"), diff(AZ.Phoenix[,1]))
    

    I don't think the paste call around the single character is doing anything so I'm pretty sure that should be fine to remove:

    assign("AZ.Phoenix.dif1", diff(AZ.Phoenix[,1]))
    

    I may be misunderstanding but this then looks like there exists in your environment a dataframe called AZ.Phoenix already. The [,1] is an indexing option which you can read as "all the rows, just the first column". Generally for any data.frame or matrix in R you can pick out specific rows and columns by calling my_dataframe[the rows to keep, the columns to keep].

    Now we're at diff(), which is a function that can convert a series like 2 4 6 8 5 into 2 2 2 -3. More on that here: https://www.r-bloggers.com/2023/06/mastering-the-power-of-rs-diff-function-a-programmers-guide/. It is your time series transformed into a series of difference between each value.

    Finally, we arrive at assign(). The two pieces of code below will do the same thing:

    # Without assign:
    my_number <- 5
    
    # With assign:
    assign("my_number", 5)
    

    I think with that you should have all the pieces you need to understand everything going on in the code you've posted.

    Why writing R code like this is not recommended

    Your code is using the power of parse, eval, and assign to create new variables in a loop. One limitation of this is that the code (like what you've posted) ends up being quite difficult to follow. But perhaps a much bigger limitation is that you cannot continue to work with the variables you've created in an automated way without diving back into more parse and eval lines of code.

    For example, what if you wanted to multiply every vector you've created by 1000 to change units? You'd either need to manually do:

    AZ.Phoenix.diff1 <- 1000 * AZ.Phoenix.diff1
    CA.Los.Angeles.diff1 <- 1000 * CA.Los.Angeles.diff1
    

    And so on, 22 times. It would be very rough and would need manual re-adjustment if you ever added more cities.

    What you should do in the future

    Learn about lapply(). The revised code with these may look something along the lines of:

    # A list of all your timeseries, defined once at the start
    city_timelines <- list(
        AZ.Phoenix,
        CA.Los.Angeles,
        # rest of cities here
    )
    
    city_diffs <- lapply(
        city_timelines,
        function(x) { 
            diff(x)
        }
    )
    

    And if you wanted to do more with the code afterwards::

    # Change units (multiply every series by 1000):
    city_diffs_new_units <- lapply(
        city_diffs,
        function(x) { 
            1000 * x
        }
    )
    
    # Access series by name
    names(city_diffs_new_units) <- city
    
    city_diffs_new_units$"AZ.Phoenix" # Will spit out the AZ.Phoenix series
    
    # Or by number
    city_diffs_new_units[[1]] # Will also spit out the AZ.Phoenix series
    

    lapply() and the other apply() functions may seem difficult to learn at first but they have a very clear logic to them that is used across many programming languages and is something that many other R programmers will much more easily be able to work with and understand than the eval/parse/assign stuff.