I would like to run a Monte Carlo simulation. I have a data.frame
where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample
each row in the data.frame
based on the weights listed for each row. Each row should only return one value per run. The data.frame
structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0
Your data.frame is transposed - the sample()
function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply()
statement. Note the apply()
will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485