Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.

I have following variables:

firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)

I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.

I have written following code which works without fail but takes ages and is highly inefficient:

foreach x in 11 12 13 21 22 23 {
display `x'

forvalues y = 1980/2010 {
display `y'

forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'

}
}
}

Data looks as shown in the image:![Data for value weighted portfolio return][1]

Solution

This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.

It seems to boil down to

gen double wanted = . 
bysort port1 year month : replace wanted = sum(mcap) 
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]

Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.

Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.

Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.

Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.

In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.

Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.

Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.