Search code examples
stata

How do I store regression results from loops in Stata?


I have built a model which basically does the following:

run regressions on single time period
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period 

The pair of variables are just the final entries in the timeframe. However, I intend to extend the single time period to rolling through a large timeframe, in essence:

for i in timeperiod {
    organise stocks into quantiles based on coefficient from linear regression
    statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
    store quantile 1 portolio and quantile 10 return for the last period 
}

The data I'm after is the portfolio 1 and 10 returns for the final day of each timeframe (built using the previous 3 years of data). This should result in a time series (of my total data 60 -3 years to build first result, so 57 years) of returns which I can then regress against eachother.

regress portfolio 1 against portfolio 10

I am coming from an R background, where storing a variable in a vector is very simple, but I'm not quite sure how to go about this in Stata.

In the end I want a 2xn matrix (a separate dataset) of numbers, each pair being results of one run of a rolling regression. Sorry for the very vague description, but it's better than explaining what my model is about. Any pointers (even if it's to the right manual entry) will be much appreciated. Thank you.

EDIT: The actual data I want to store is just a variable. I made it confusing by adding regressions. I've changed the code to more represent what I want.


Solution

  • Unlike R, Stata operates with only one major rectangular object in memory, called (ta-da!) the data set. (It has a multitude of other stuff, of course, but that stuff can rarely be addressed as easily as the data set that was brought into memory with use). Since your ultimate goal is to run a regression, you will either need to create an additional data set, or awkwardly add the data to the existing data set. Given that your problem is sufficiently custom, you seem to need a custom solution.

    Solution 1: create a separate data set using post (see help).

    use my_data, clear
    postfile topost int(time_period) str40(portfolio) double(return_q1 return_q10) ///
         using my_derived_data, replace
    * 1. topost is a placeholder name
    * 2. I have no clue what you mean by "storing the portfolio", so you'd have to fill in
    * 3. This will create the file my_derived_data.dta, 
    *    which of course you can name as you wish
    * 4. The triple slash is a continuation comment: the code is coninued on next line
    
    levelsof time_period, local( allyears )
    * 5. This will create a local macro allyears 
    *    that contains all the values of time_period
    
    foreach t of local allyears {
       regress outcome x1 x2 x3 if time_period == `t', robust
       * 6. the opening and closing single quotes are references to Stata local macros
       *    Here, I am referring to the cycle index t
    
       organise_stocks_into_quantiles_based_on_coefficient_from_linear_regression
       * this isn't making huge sense for me, so you'll have to put your code here
       * don't forget inserting if time_period == `t' as needed
       * something like this:
       predict yhat`t' if time_period == `t', xb
       xtile decile`t' = yhat`t' if time_period == `t', n(10)
    
       calculate_portfolio_returns_for_stocks_based_on_quantile
       forvalues q=1/10 {
            * do whatever if time_period == `t' & decile`t' == `q'
       }
    
       * store quantile 1 portolio and quantile 10 return for the last period 
       * again I am not sure what you mean and how to do that exactly
       * so I'll pretend it is something like
       ratio change / price if time_period == `t' , over( decile`t' )
       post topost (`t') ("whatever text describes the time `t' portfolio") /// 
           (_b[_ratio_1:1]) (_b[_ratio_1:10])
       * the last two sets of parentheses may contain whatever numeric answer you are producing
    }
    
    postclose topost
    * 7. close the file you are creating
    
    use my_derived_data, clear
    tsset time_period, year
    newey return_q10 return_q1, lag(3)
    * 8. just in case the business cycles have about 3 years of effect
    
    exit
    * 9. you always end your do-files with exit
    

    Solution 2: keep things within your current data set. If the above code looks awkward, you can instead create a weird centaur of a data set with both your original stocks and the summaries in it.

    use my_data, clear
    
    gen int collapsed_time = .
    gen double collapsed_return_q1 = .
    gen double collapsed_return_q10 = .
    * 1. set up placeholders for your results
    
    levelsof time_period, local( allyears )
    * 2. This will create a local macro allyears 
    *    that contains all the values of time_period
    
    local T : word count `allyears'
    * 3. I now use the local macro allyears as is
    *    and count how many distinct values there are of time_period variable
    
    forvalues n=1/`T' {
       * 4. my cycle now only runs for the numbers from 1 to `T'
    
       local t : word `n' of `allyears'
       * 5. I pull the `n'-th value of time_period
    
       ** computations as in the previous solution
    
       replace collapsed_time_period = `t' in `n'
       replace collapsed_return_q1 = (compute) in `n'
       replace collapsed_return_q10 = (compute) in `n'
       * 6. I am filling the pre-arranged variables with the relevant values
    }
    
    tsset collapsed_time_period, year
    * 7. this will likely complain about missing values, so you may have to fix it
    newey collapsed_return_q10 collapsed_return_q1, lag(3)
    * 8. just in case the business cycles have about 3 years of effect
    
    exit
    * 9. you always end your do-files with exit
    

    I avoided statsby as it overwrites the data set in memory. Remember that unlike R, Stata can only remember one data set at a time, so my preference is to avoid excessive I/O operations as they may well be the slowest part of the whole thing if you have a data set of 50+ Mbytes.