Search code examples
rdataframerandomconditional-compilation

Implementing random number probabilities in an existing R data frame, probabilities used are contingent upon another column


I'm using Walker's alias method to adjust random rounded data, that is base 3. I have already assigned the column value to the each value of 3 in the dataframe, the AliasColumn'. The values in theAliasColumnare integers in the range1through5`. I've used the Alias Method from here. The dataframe looks like this (it has 64 rows):

 Industry     AliasColumn
 1            5
 2            5
 3            4
 4            2
 5            3
 6            1
 7            2
 8            2
 9            3
10            5
11            4
12            4
13            4
14            2
15            2
16            1
17            4
18            3
19            5
20            5

Based on the AliasColumn value, I need to toss a loaded coin to create the "real" business count (NumBusinesses), which is between 1 and 5. The loaded coin table is:

 AliasColumn      1      2      3     4       5
 "Heads prob"    8/12   11/12   1    10/12    5/12
 "Alias prob"    4/12    1/12   -     2/12    7/12
 Alias value      2      3      -     3       1

For example, if the AliasColumn value is 1, then 8/12 of the time the NumBusinesses value will be 1 and 4/12 of the time the NumBusinesses value will be 2. For the AliasColumn value of 3, that is the only value that can be assigned to NumBusinesses.

Thus, NumBusinesses receives one of two values, with a probability conditional to the specific column in the AliasColumn. Because the NumBusinesses column can only take one of two values, and because these are integers, and differ depending on the value in AliasColumn, I was hoping to use the sample() function in R. I have been unable to get this to work.

I have tried (I've just noticed in my code that I have show the code for AliasColumn value 4 differently to that for 1 and 2, but the output didn't seem any different to when I ran it initially with 1:2 and 2:3 instead of 1,2 and 2,3, respectively):

foo$NumBusinesses[AliasCol==1] <-sample(c(1,2),1, replace=TRUE,prob=c(8,4))
foo$NumBusinesses[AliasCol==2] <-sample(c(2,3),1, replace=TRUE,prob=c(11,1))
foo$NumBusinesses[AliasCol==3] <- 3
foo$NumBusinesses[AliasCol==4] <-sample(c(3:4),1, replace=TRUE,prob=c(2,10))
foo$NumBusinesses[AliasCol==5] <-sample(c(1,5),1, replace=TRUE,prob=c(7,5))

This seems to set the NumBusinesses value to be the same as that in AliasColumn, apart from when the NumBusinesses value is 5, and then the AliasColumn value is being set to 1.

I considered an ifelse loop, and attempted one:

ifelse(foo$AliasCol==1, foo$NumBusinesses<- Sample(c(1,2),1, replace=TRUE,prob=c(8,4)),
                                       ifelse(foo$AliasCol==2),
                                       foo$NumBusinesses<- sample(c(2,3),1, replace=TRUE,prob=c(11,1)),
                                       ifelse(foo$AliasCol==3), foo$NumBusinesses<- 3,
                                       ifelse(foo$AliasCol==4), 
                                       foo$NumBusinesses <- sample(c(3:4),1, replace=TRUE,prob=c(2,10)),
                                       foo$NumBusinesses <- sample(c(1,5),1, replace=TRUE,prob=c(7,5)))

And I received this error (which makes me believe I am overthinking the loop):

 Error in ifelse(foo$AliasCol == 1, foo$NumBusinesses <- sample(c(1,  :   unused arguments (foo3$NumBusinesses <- sample(c(2, 3), 1, replace = TRUE, prob = c(11, 1)), ifelse(foo$AliasCol == 3), foo$NumBusinesses <- 3, ifelse(foo$AliasCol == 4), foo$NumBusinesses <- sample(c(3:4), 1, replace = TRUE, prob = c(2, 10)), foo$NumBusinesses <- sample(c(1, 5), 1, replace = TRUE, prob = c(7, 5)))

How can I generate my conditional output in one step, or one set of steps?


Solution

  • Say you have this:

    #probabilities of not changing AliasColumn
    headProb<-c(8/12,   11/12,   1 ,   10/12,    5/12)
    #alias values when AliasColumn changes
    aliasValues<-c(2,3,NA,3,1)
    #your data.frame
    df<-structure(list(Industry = 1:20, AliasColumn = c(5L, 5L, 4L, 2L, 
    3L, 1L, 2L, 2L, 3L, 5L, 4L, 4L, 4L, 2L, 2L, 1L, 4L, 3L, 5L, 5L
    )), .Names = c("Industry", "AliasColumn"), class = "data.frame", row.names = c(NA, -20L))
    

    Then you can try:

    ifelse(runif(nrow(df))<=headProb[df$AliasColumn],
           df$AliasColumn,aliasValues[df$AliasColumn])