Search code examples
randomstatacausality

Stata Simple Randomization in One Line WITHOUT USING SPLITSAMPLE


I'd Like to create a simple randomization of a dataset. The goal is to have 500 in treatment and 500 in control. This question is about Stata efficiency: I want to do it in one line.

I can do it in one line with imbalanced groups or three lines with perfect balance.

One line:

clear all
set obs 1000

//one line
gen treatment = mod(floor(runiform() * 1000),2)

This is most likely imbalanced.

Three lines:

gen rand_n =  runiform()
sum (rand_n),d
gen treatment_again =  rand_n <= r(p50)

clunky, terrible, you can't even bysort in a single line like this!

I want to do this in one line, maybe two.

Why? Because Stata.


Solution

  • Since splitsample is precluded (it is slow), there are two options.

    First, you can repackage your clunky code into a program on the fly. I am not sure if that counts as a solution in your mind, but is a good strategy if you have to sample multiple times.

    Second, use egenmore (short for extended generate). egenmore is usually where one-line solutions to such problems can be found. You will need to install it with ssc install egenmore as it is a community-contributed command.

    Here's an example of all three producing balanced groups of 500:

    . clear all
    
    . timer clear
    
    . set obs 1000
    Number of observations (_N) was 0, now 1,000.
    
    . 
    . 
    . timer on 1
    
    . splitsample, nsplit(2) gen(treatment)
    
    . timer off 1
    
    . 
    . 
    . timer on 2
    
    . capture program drop my_ss 
    
    . program define my_ss
      1.         capture drop treatment_again
      2.         tempvar r
      3.         gen `r' =  runiform()
      4.         _pctile `r', percentile(50)
      5.         di r(r50)
      6.         gen treatment_again = `r' <= r(r1)
      7. end
    
    . 
    . my_ss
    .
    
    . timer off 2
    
    . 
    . timer on 3
    
    . egen treatment_again2 = rndsub()
    
    . timer off 3
    
    . 
    . timer list
       1:      0.17 /        1 =       0.1700
       2:      0.00 /        1 =       0.0010
       3:      0.00 /        1 =       0.0020
    
    . 
    . tab1 treatment*
    
    -> tabulation of treatment  
    
      treatment |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |        500       50.00       50.00
              2 |        500       50.00      100.00
    ------------+-----------------------------------
          Total |      1,000      100.00
    
    -> tabulation of treatment_again  
    
    treatment_a |
           gain |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        500       50.00       50.00
              1 |        500       50.00      100.00
    ------------+-----------------------------------
          Total |      1,000      100.00
    
    -> tabulation of treatment_again2  
    
    treatment_a |
          gain2 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |        500       50.00       50.00
              2 |        500       50.00      100.00
    ------------+-----------------------------------
          Total |      1,000      100.00