Search code examples
rloopsvectormedianseq

R Estimating the population median value by combining sample medians


I need to calculate the population median for several time periods by combining the medians of 10 different samples in each period (dataset Median). Each of the sample median has been obtained by taking a different number of observations (dataset Observation).

Median - dataset

       Time1    Time2   Time3   Time4   Time5
Sample1 60000   71139   70000   75000   75000
Sample2 80000   88000   87750   88500   90000
Sample3 66000   73325   73000   78126   75000
Sample4 60000   74000   72000   75500   73000
Sample5 50500   60000   60000   66750   81500
Sample6 60000   70000   72000   78500   80000
Sample7 50000   60000   59999   63000   60000
Sample8 53000   55000   58300   59995   64500
Sample9 92529   111000  115000  120063  118000
Sample10 92500  115000  101000  104100  110075 

Observations - dataset

Time1   Time2   Time3   Time4   Time5
Sample1 159 202 174 134 172
Sample2 148 178 148 121 140
Sample3 563 680 652 513 678
Sample4 554 634 518 512 595
Sample5 343 415 347 270 390
Sample6 738 954 769 720 825
Sample7 704 949 863 648 762
Sample8 595 681 640 517 663
Sample9 517 782 610 504 472
Sample10    627 733 621 493 512

I am trying to generate a vector with the Median[1:1] repeated Observations[1:1] times, this vector need to be concatenate to another vector Median[1:2] repeated Observations[1:2] times, then concatenate the vector to another vector Median[1:3] repeated Observations[1:3] times , and so on...

I aim to generate 5 vectors (as many as columns - periods) each of these vectors with a length equal to the aggregate number of sample observations in each time frame.

 for (i in 1:ncol(Median))  {

   for (j in 1:nrow(Median)) { 

  vector_median=(seq(as.numeric(Med[i,j]),as.numeric(Med   [i,j]),length.out=as.numeric(Observations[i,j])))

 }
 }

Solution

  • Consider a nested mapply (the multiple-input version of apply family) where you pass both Med and Observations columns in pairwise iteration and then pass each of the columns corresponding Sample values in a pairwise iteration into the rep() function:

    Data

    txt = "       Time1    Time2   Time3   Time4   Time5
    Sample1 60000   71139   70000   75000   75000
    Sample2 80000   88000   87750   88500   90000
    Sample3 66000   73325   73000   78126   75000
    Sample4 60000   74000   72000   75500   73000
    Sample5 50500   60000   60000   66750   81500
    Sample6 60000   70000   72000   78500   80000
    Sample7 50000   60000   59999   63000   60000
    Sample8 53000   55000   58300   59995   64500
    Sample9 92529   111000  115000  120063  118000
    Sample10 92500  115000  101000  104100  110075 "
    
    Med = read.table(text=txt, header=TRUE)
    
    txt = "Time1   Time2   Time3   Time4   Time5
    Sample1 159 202 174 134 172
    Sample2 148 178 148 121 140
    Sample3 563 680 652 513 678
    Sample4 554 634 518 512 595
    Sample5 343 415 347 270 390
    Sample6 738 954 769 720 825
    Sample7 704 949 863 648 762
    Sample8 595 681 640 517 663
    Sample9 517 782 610 504 472
    Sample10    627 733 621 493 512"
    
    Obs = read.table(text=txt, header=TRUE)
    

    Process

    replicate_medians <- function(m,o){      
      mapply(function(m_sub, o_sub) rep(m_sub, times=o_sub), m, o)      
    }
    
    output <- mapply(function(x,y) unlist(replicate_medians(x,y)), Med, Obs, SIMPLIFY=FALSE)    
    
    # EQUIVALENT WITH Map() WRAPPER
    output <- Map(function(x,y) unlist(replicate_medians(x,y)), Med, Obs)
    

    Output (returns a list of 5 named numeric vectors)

    str(output)
    # List of 5
    #  $ Time1: int [1:4948] 60000 60000 60000 60000 60000 60000 60000 60000 60000 60000 ...
    #  $ Time2: int [1:6208] 71139 71139 71139 71139 71139 71139 71139 71139 71139 71139 ...
    #  $ Time3: int [1:5342] 70000 70000 70000 70000 70000 70000 70000 70000 70000 70000 ...
    #  $ Time4: int [1:4432] 75000 75000 75000 75000 75000 75000 75000 75000 75000 75000 ...
    #  $ Time5: int [1:5209] 75000 75000 75000 75000 75000 75000 75000 75000 75000 75000 ...
    
    length(output$Time1[output$Time1==60000])
    #[1] 1451   <---- THREE SAMPLES WITH THIS MEDIAN: 159 + 554 + 738 = 1,451
    
    length(output$Time1[output$Time1==80000])
    # [1] 148
    
    length(output$Time1[output$Time1==66000])
    # [1] 563