Search code examples
rbioinformaticsgenome

Block bootstrap for genomic data


I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.

My data.frame has the following structure:

CHR POS var_A var_B
1 192 0.9 0.7
1 2000  0.8 0.3
2 3 0.21  0.76 
2 30009 0.36  0.15
...

The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).

The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.

I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.

x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
  df=freq[0,]
  for (i in unique(freq$CHR)) {
    t = freq[freq$CHR==i,]
    fim = t[nrow(t),2]
    i = t[1,2]
    f = i + x
    while(f < fim) {
      rows = which(t$POS>=i & t$POS<f)
      s = sample(rows)
      df = rbind(df,t[s,])
      i = f
      f = f + x
    }
  }
  cors = c(cors, cor(df$var_A, df$var_B))
}

Could anybody help me out? I am sure there is a more efficient way of doing this.

Thank you in advance.


Solution

  • So, after a while I came up with an answer to my problem. Here it goes.

    You'll need the package dplyr.

    l = 1000
    teste = freq %>%
      mutate(w = ceiling(POS/l)) %>%
      group_by(CHR, w) %>%
      sample_n(1)
    

    This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.

    You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.