Search code examples
rparallel-processingmedian

R median large vector


i'm having trouble at finding a way to calculate faster the median and mean of a large vector in R. How would I implement a faster way? I'm doing the code above, but its too slow. I'm thinking about parallel processing, but i have no ideia how to make this work. Thanks.

    vector <- 1:10000000000
    m <- mean(vector)
    md <- median(vector)

Solution

  • Assuming we're dealing with a sequential integer vector 1:n. This may help you:

    ## Given
    V <- 1:10e8    
    n <- length(V)
    
    ## To get median,
    median <- ifelse(n %% 2 == 0, mean(V [(n/2):((n/2) + 1)]), V [(n + 1)/2])
    median
    OUTPUT: 5e+08
    
    ## To get mean,
    sum_series <- n*(n + 1) / 2    # Mathematical Fact
    mean <- sum_series / n
    mean
    OUTPUT: 5e+08
    

    For large random vectors, the median still works the same. The mean you can estimate if it doesn't have a closed formula:

    ### Estimation via Repeated Sampling ### 
    est_mean <- function (V, k, size) {
      # k: Number of means to use in estimation
      # size: Sample size of each estimation  
      est <- rep(NA, k)
      samp <- matrix(NA, nrow = size, ncol = k)
    
      for (j in 1:k) samp [, j] <- sample(V, size, replace = TRUE)
      for (j in 1:k) est [j] <- mean(samp [, j])
      est <- sort(est)
    
      return(est [ceiling(length(est)/2)])
    }
    
    ### Time Complexity of Estimation ### 
    # samp + est = k*size + k 
    #     If size, k ~ 30 --> Enough to get normal mean distribution
    # iterate amount*(create sample vector + store) = k*(size + size)
    #     --> 2*k*size 
    # Total = k + 3*k*size --> constant
    
    ### Time Complexity of Base R Mean () ###
    # Assuming it's this: mean (V) <- sum(V)/length(V)
    # sum N items + find length + 1 division + 1 return = N + 3
    
    
    ### Example ###
    set.seed(0)
    V <- sort(sample(0:10e8, 10e7, replace = TRUE))
    
    start1 <- Sys.time()
    est_mu <- est_mean(V, 1000, 30)
    end1 <- Sys.time()
    diff1 <- end1 - start1
    
    start2 <- Sys.time()
    r_mu <- mean (V)
    end2 <- Sys.time()
    diff2 <- end2 - start2
    
    diff1
    OUTPUT: Time difference of 0.08370018 secs
    diff2
    OUTPUT: Time difference of 0.5321879 secs
    
    print(paste("% Difference = ", abs(r_mu - est_mu)/r_mu))
    OUTPUT: "% Difference =  0.00678363793285072"