Search code examples
rmeanmedianmontecarlo

R: Trying to recreate mean-median difference gerrymander tests


I'm trying to recreate the mean-median difference test described here: Archive of NYT article. I've downloaded House data from MIT's Election Lab, and pared it down to the 2012 Pennsylvania race. Using dplyr, I whittled it down to the relevant columns, and it now looks something like this:

Rows: 42
Columns: 5
$ district       <dbl> 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 1~
$ party          <chr> "REPUBLICAN", "DEMOCRAT", "INDEPENDENT", "REPUBLICAN", "DEMOCRAT", "DEMOCRAT", ~
$ candidatevotes <dbl> 41708, 235394, 4829, 33381, 318176, 123933, 165826, 12755, 6210, 181603, 11524,~
$ totalvotes     <dbl> 277102, 277102, 356386, 356386, 356386, 302514, 302514, 302514, 303980, 303980,~
$ pct_votes      <dbl> 15.051497, 84.948503, 1.354991, 9.366530, 89.278479, 40.967691, 54.815975, 4.21~

Each row represents a district candidate. The final column was created using mutate, and represents the percentage of the vote in that district that went to the candidate. Now, I can find the median and mean democratic vote with

PA2012_house_dem <- PA2012_house %>% filter(party == "DEMOCRAT") 
obs_median <- median(PA2012_house_dem$pct_votes)
obs_mean <- mean(PA2012_house_dem$pct_votes)
obs_median - obs_mean

What's giving me fits is calculating the "zone of chance". What I'd like to do is some kind of Monte Carlo simulation of taking each voter and randomly assigning them to a district, so that the number of voters in each district is unchanged, the number of total votes for each party is unchanged, but the proportion of Republican and Democratic (and other parties) in each district is random, as in a permutation test. The mean Democratic vote should be unchanged, but I can't figure out a good way to carry out this randomization so that I can calculate the median district's Democratic vote percentage.

Thanks in advance for your help!

Edit for clarification: I'd like to do the randomization, say, 10,000 times, and for each of those trials, calculate the median-mean difference. The result should then, ideally, be a vector or data frame with 10,000 rows, that I can then turn into a histogram or something.

EDIT 2 -- PARTIAL SOLUTION:

I have some code that runs, but it's not giving me a reasonable answer. Using dplyr, I've filtered out all but the DEMOCRAT votes, so that each row just gives me the Democrat vote share for a single district.

Rows: 18
Columns: 5
$ district       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
$ party          <chr> "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCRAT", "DEMOCR~
$ candidatevotes <dbl> 235394, 318176, 123933, 104643, 104725, 143803, 143509, 152859, 105128, 94227, 118231, 163589, 209901, ~
$ totalvotes     <dbl> 277102, 356386, 302514, 303980, 282465, 335528, 353451, 352238, 274305, 273790, 285198, 338941, 303819,~
$ pct_votes      <dbl> 84.94850, 89.27848, 40.96769, 34.42430, 37.07539, 42.85872, 40.60223, 43.39651, 38.32522, 34.41579, 41.~

This is saved as PA2012_reduced_dem.

Now, here is my code:

require(mosaic) # for the tally() function
data <- PA2012_reduced_dem
B <- 100
samples_diff <- vector("numeric", B)
samples_mean <- vector("numeric", B)
samples_median <- vector("numeric", B)

for(samp in 1:B) {
data_w_sample <- mutate(data, sample_vote = tally(sample(district, sum(candidatevotes),replace=T, prob = totalvotes)))
  data_w_sample <- mutate(data_w_sample, sample_vote_pct = (sample_vote / totalvotes *100))
  mean_sample <- weighted.mean(data_w_sample$sample_vote_pct, w = data_w_sample$totalvotes)
  median_sample <- median(data_w_sample$sample_vote_pct)
  diff_mean_median <- mean_sample - median_sample
  samples_diff[samp] <- diff_mean_median
  samples_mean[samp] <- mean_sample
  samples_median[samp] <- median_sample
}

samples <- data.frame(samples_mean,samples_median,samples_diff)

The idea is that I'm randomly placing each Democrat voter in a district, weighted by the total number of votes per district. Since I have the total vote as a variable, I can compute the share of vote in each district that goes to the Democrat (I'm ignoring independent and other party votes).

Obviously, this is slow, because each trial is sampling for every single Democrat vote (roughly 2.8 million), so I'm only running 100 trials right now.

However, my Monte Carlo simulations are finding a very small "zone of chance" around the mean, the median is only about 0.05 percent above or below the mean. Again, I'm only running 100 trials, but I was expecting a wider zone of chance.


Solution

  • I figured it out! Randomly placing voters in each district is not correct, and honestly it was pretty silly of me to do so. Instead, I had to use dplyr to create a data frame with the number of Democrat and Republican votes in each of the 435 House districts, one district per row. Then, I followed the advice on page 12 of this paper. I created samples of 18 districts sampled from this 435-row data frame, rejecting them if the mean vote share was more than 1 percent away from that of PA. The results have a much nicer 95% confidence interval, that matches the results of the original article.

    data <- house_2012_reduced 
    # created with dplyr, contains total and percentage of votes
    # for Democrats and Republicans.
    B <- 100000
    del_districts <- 18 # 18 districts in PA
    samples_diff <- vector("numeric", B)
    samples_mean <- vector("numeric", B)
    samples_median <- vector("numeric", B)
    
    for(samp in 1:B) {
      sample_delegation <- sample_n(data, del_districts)
      sample_delegation_pct_dem_mean <- weighted.mean(sample_delegation$pct_dem_votes, w = sample_delegation$total_votes)
      sample_delegation_pct_dem_median <- median(sample_delegation$pct_dem_votes)
      if(near(mean_dem_pct_PA, sample_delegation_pct_dem_mean, 1)){
        samples_mean[samp] <- sample_delegation_pct_dem_mean
        samples_median[samp] <- sample_delegation_pct_dem_median
        samples_diff[samp] <- (sample_delegation_pct_dem_mean - sample_delegation_pct_dem_median)
      }
    }
    
    samples <- data.frame(samples_mean,samples_median,samples_diff)
    samples <- filter_all(samples, any_vars(. != 0))
    quantile(samples$samples_median, c(0.025,0.975))