Search code examples
rsequencestatistics-bootstrap

R - implement bootstrap function for paired (mis)matches


I have trouble implementing a function in the boot library.

The function I want to implement is the following

fsyn = function(x) sum( x[1,] == x [2,] )

It's a count of the matches between two sequences.

My data is a set of sequences such as

   id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   1  c  a  c  b  c  c  b  d  d   a
2   1  c  d  a  a  c  b  d  a  b   a
3   2  b  d  c  b  b  b  c  d  a   b
4   2  b  a  b  c  b  c  d  b  a   d

Something important about these sequences is the fact that they are paired by id.

I am interested in doing two things, the first is to bootstrap the number of matches by id and the second for two random individuals.

The first procedure can be implemented by

library(dplyr) 

chid = df$id
# sampling paired sequences # 
wchid = function(chid) which(chid %in% sample(chid, 1))
# the matches function # 
fsyn = function(x) sum( x[1,] == x [2,] ) 
# wrapping the function # 
funcHamC = function(df) df[wchid(chid), -1] %>% fsyn 

with

df %>% funcHamC

The second function can simply be written like

funcHamR =  function(df) df[sample(df$id, 2), -1] %>% fsyn
df %>% funcHamR

However, I have issues using these two functions with boot.

library(boot)
boot(df, funcHamC, R = 10)
boot(df, funcHamR, R = 10)

This isn't working. Any idea ?

data

df = as.data.frame( t(replicate(20, sample(letters[1:4], 10, T))) ) 
df$id = rep(1:10, 2)
df = df %>% select(id, everything()) %>% arrange(id)

Solution

  • The boot function expects two arguments to the statistic function—the second being a parameter to specify the sample values to choose. Because you are using your own methods to choose randomly from the data, you should set the sim argument to 'parametric'. This will use the ran.gen argument to specify a function to generate random values from the data.

    To quote from the help file: "If ran.gen is not specified, the default is a function which returns the original data in which case all simulation should be included as part of statistic."

    Save the boot output to variables—such as C.boot and R.boot and you will find the samples in C.boot$t and R.boot$t.

    C.boot <- boot(df, statistic=funcHamC, R = 10, sim='parametric')
    R.boot <- boot(df, statistic=funcHamR, R = 10, sim='parametric')
    

    You can then obtain the stats you need from the generated values.