Search code examples
rdata.tablestatacode-translation

Translating Stata to R: collapse


Just came across a .do file that I need to translate into R because I don't have a Stata license; my Stata is rusty, so can someone confirm that the code is doing what I think it is?

For reproducibility, I'm going to translate it into a data set I found online, specifically the Milk Production dataset (p004) that's part of a textbook by Chatterjee, Hadi and Price.

Here's the Stata code:

collapse (min) min_protein = protein /// 
         (mean) avg_protein = protein /// 
         (median) median_protein = protein /// 
         (sd) sd_protein = protein /// 
         if protein > 2.8, by(lactatio)

Here's what I think it's doing in data.table syntax:

library(data.table)
library(foreign)
DT = read.dta("p004.dta")
setDT(DT)

DT[protein > 2.8,
   .(min_protein = min(protein),
     avg_protein = mean(protein),
     median_protein = median(protein),
     sd_protein = sd(protein)),
   keyby = lactatio]

#    lactatio min_protein avg_protein median_protein sd_protein
# 1:        1         2.9    3.162632           3.10  0.2180803
# 2:        2         2.9    3.304688           3.25  0.2858736
# 3:        3         2.9    3.371429           3.35  0.4547672
# 4:        4         2.9    3.231250           3.20  0.3419917
# 5:        5         2.9    3.855556           3.20  1.9086061
# 6:        6         3.0    3.200000           3.10  0.2645751
# 7:        7         3.3    3.650000           3.65  0.4949748
# 8:        8         3.2    3.300000           3.30  0.1414214

Is that correct?

This would be easy to confirm if I had used Stata in the past 18 months or if I had a copy installed--hoping I can bend the ear of someone for whom either of these is true. Thanks.


Solution

  • Your intuition is correct. collapse is the Stata equivalent of R's aggregate function, which produces a new dataset from an input dataset by applying an aggregating function (or multiple aggregating functions, one per variable) to every variable in a dataset.

    Here's the output for that Stata command on the example dataset:

    . list
    
         +------------------------------------------------------+
         | lactatio   min_pr~n   avg_pr~n   median~n   sd_pro~n |
         |------------------------------------------------------|
      1. |        1        2.9   3.162632        3.1   .2180803 |
      2. |        2        2.9   3.304688       3.25   .2858736 |
      3. |        3        2.9   3.371429       3.35   .4547672 |
      4. |        4        2.9    3.23125        3.2   .3419917 |
      5. |        5        2.9   3.855556        3.2   1.908606 |
         |------------------------------------------------------|
      6. |        6          3        3.2        3.1   .2645752 |
      7. |        7        3.3       3.65       3.65   .4949748 |
      8. |        8        3.2        3.3        3.3   .1414214 |
         +------------------------------------------------------+