Search code examples
rrandomsamplemontecarloweighted

Weighted random sampling for Monte Carlo simulation in R


I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:

ID,    X2000,  X2001,  X2002,  X2003,  X2004
X11,   0,      0,      0.5,    0.5,    0
X33,   0.25,   0.25,   0.25,   0.25,   0
X55,   0,      0,      0,      0,      1
X77,   0.5,    0,      0,      0,      0.5

For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.

The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:

ID,    X2000,  X2001,  X2002,  X2003,  X2004
X11,   0,      0,      1,      0,      0
X33,   1,      0,      0,      0,      0
X55,   0,      0,      0,      0,      1
X77,   1,      0,      0,      0,      0

Solution

  • Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.

    To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.

    mc_df <- read.table(
    text = 
    'ID    X2000  X2001  X2002  X2003  X2004
    X11   0      0      0.5    0.5    0
    X33   0.25   0.25   0.25   0.25   0
    X55   0      0      0      0      1
    X77   0.5    0      0      0      0.5'
                        , header = T
                        ,row.names = 1)
    

    From there, can use the apply function:

    apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
    

    Or you could make it fancy

    apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
    
    $X11
    
    X2002 X2003 
      102    98 
    
    $X33
    
    X2000 X2001 X2002 X2003 
       54    47    64    35 
    
    $X55
    
    X2004 
      200 
    
    $X77
    
    X2000 X2004 
      103    97 
    

    Fancier:

    apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
          X11 X33 X55 X77
    X2000   0  51   0  99
    X2001   0  50   0   0
    X2002  91  57   0   0
    X2003 109  42   0   0
    X2004   0   0 200 101
    

    Or fanciest:

    prop.table(apply(mc_df
                     , 1
                     , function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
               ,2)
           X11   X33 X55   X77
    X2000 0.00 0.270   0 0.515
    X2001 0.00 0.235   0 0.000
    X2002 0.51 0.320   0 0.000
    X2003 0.49 0.175   0 0.000
    X2004 0.00 0.000   1 0.485