I'd Like to create a simple randomization of a dataset. The goal is to have 500 in treatment and 500 in control. This question is about Stata efficiency: I want to do it in one line.
I can do it in one line with imbalanced groups or three lines with perfect balance.
One line:
clear all
set obs 1000
//one line
gen treatment = mod(floor(runiform() * 1000),2)
This is most likely imbalanced.
Three lines:
gen rand_n = runiform()
sum (rand_n),d
gen treatment_again = rand_n <= r(p50)
clunky, terrible, you can't even bysort in a single line like this!
I want to do this in one line, maybe two.
Why? Because Stata.
Since splitsample
is precluded (it is slow), there are two options.
First, you can repackage your clunky code into a program on the fly. I am not sure if that counts as a solution in your mind, but is a good strategy if you have to sample multiple times.
Second, use egenmore
(short for extended generate). egenmore
is usually where one-line solutions to such problems can be found. You will need to install it with ssc install egenmore
as it is a community-contributed command.
Here's an example of all three producing balanced groups of 500:
. clear all
. timer clear
. set obs 1000
Number of observations (_N) was 0, now 1,000.
.
.
. timer on 1
. splitsample, nsplit(2) gen(treatment)
. timer off 1
.
.
. timer on 2
. capture program drop my_ss
. program define my_ss
1. capture drop treatment_again
2. tempvar r
3. gen `r' = runiform()
4. _pctile `r', percentile(50)
5. di r(r50)
6. gen treatment_again = `r' <= r(r1)
7. end
.
. my_ss
.
. timer off 2
.
. timer on 3
. egen treatment_again2 = rndsub()
. timer off 3
.
. timer list
1: 0.17 / 1 = 0.1700
2: 0.00 / 1 = 0.0010
3: 0.00 / 1 = 0.0020
.
. tab1 treatment*
-> tabulation of treatment
treatment | Freq. Percent Cum.
------------+-----------------------------------
1 | 500 50.00 50.00
2 | 500 50.00 100.00
------------+-----------------------------------
Total | 1,000 100.00
-> tabulation of treatment_again
treatment_a |
gain | Freq. Percent Cum.
------------+-----------------------------------
0 | 500 50.00 50.00
1 | 500 50.00 100.00
------------+-----------------------------------
Total | 1,000 100.00
-> tabulation of treatment_again2
treatment_a |
gain2 | Freq. Percent Cum.
------------+-----------------------------------
1 | 500 50.00 50.00
2 | 500 50.00 100.00
------------+-----------------------------------
Total | 1,000 100.00