Search code examples
rrangeintervalscut

R: Dividing vector into intervals, and test which integer falls into what interval


I have 23 chromosomes and their lengths

chromosome    length
1             249250621
2             243199373
3             198022430
4             191154276
5             180915260
6             171115067 
..            .........
Y             59373566

For each chromosome, I want to create 5000 bins/intervals of equal size.

Chr1:
bin_number    start        end
1             1            49850
2             49851        99700
....          .....        .....
5000          249200771    249250621

I have tried using both "cut" and "cut2" for this purpose. "cut2" cannot handle the lenght of the chromosomes and crashes, while cut provides one with intervals for each individual place (249250621 intervals!).

cut2(1:249250621, g=5000, onlycuts = TRUE)

cut(1:249250621, breaks=5000)

When I have the the intervals, I want to assign which bin/interval 50.000 variants each fall within.

My data (Chromosome 1):

variant            chromosome    position
1:20000_G/A        1             20000
1:30000_C/CCCCT    1             30000
1:60000_G/T        1             60000
..............     ..            .......

What I want:

variant            chromosome    position    bin_number
1:20000_G/A        1             20000       1
1:30000_C/CCCCT    1             30000       1
1:60000_G/T        1             60000       2
..............     ..            .......     ...

I would appreciate any suggestions for methods that are relevant for splitting my chromosomes into intervals. When I have the intervals, I need methods that quickly can test which interval the variant belongs to.


Solution

  • Thanks for your input. I have chosen to create the intervals using a simple loop, to ensure that the intervals were of the desired size.

    I created a data.frame with sizes of chromosomes

    chrSizes <- data.frame(chromosome = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"), length = c("249250621","243199373", "198022430", "191154276", "180915260", "171115067", "159138663", "146364022", "141213431", "135534747", "135006516", "133851895", "115169878", "107349540", "102531392", "90354753", "81195210", "78077248", "59128983", "63025520", "48129895", "51304566", "155270560", "59373566"), stringsAsFactors = FALSE)
    

    Then I looped through each chromosome creating the intervals, by finding the precise chunk size and then rounding down. The remainder then was used to add one to the first many intervals.

    numOfBins <- 10000
    chrBinList <- list()
    for (i in 1:24) {
      chrBins <- c()
      chrLength <- as.numeric(chrSizes[i,2])
      chunkSize <- floor(chrLength/numOfBins)
      remainder <- chrLength %% chunkSize
      counter <- 1
    
      # Adding remainder to the first intervals
      for (j in 1:(remainder-1)) {
        chrBins <- c(chrBins, counter)
        counter <- counter + chunkSize + 1
        chrBins <- c(chrBins, counter)
      }
    
      # Adding normal sized chunks to remaining intervals
      for (k in remainder:numOfBins) {
        chrBins <- c(chrBins, counter)
        counter <- counter + chunkSize
        chrBins <- c(chrBins, counter)
      }
    
      # Creating a data.frame with intervals
      interval.df <- as.data.frame(matrix(chrBins,ncol = 2, byrow = TRUE))
      colnames(interval.df) <- c("start", "end")
    
      # Adding to list
      chrBinList[[chrSizes[i,1]]] <- interval.df
    }
    

    As for testing whether values fall within the different bins, I have come up with a slow solution using apply. I am, however, currently look into the parallel apply functions.