I have 23 chromosomes and their lengths
chromosome length
1 249250621
2 243199373
3 198022430
4 191154276
5 180915260
6 171115067
.. .........
Y 59373566
For each chromosome, I want to create 5000 bins/intervals of equal size.
Chr1:
bin_number start end
1 1 49850
2 49851 99700
.... ..... .....
5000 249200771 249250621
I have tried using both "cut" and "cut2" for this purpose. "cut2" cannot handle the lenght of the chromosomes and crashes, while cut provides one with intervals for each individual place (249250621 intervals!).
cut2(1:249250621, g=5000, onlycuts = TRUE)
cut(1:249250621, breaks=5000)
When I have the the intervals, I want to assign which bin/interval 50.000 variants each fall within.
My data (Chromosome 1):
variant chromosome position
1:20000_G/A 1 20000
1:30000_C/CCCCT 1 30000
1:60000_G/T 1 60000
.............. .. .......
What I want:
variant chromosome position bin_number
1:20000_G/A 1 20000 1
1:30000_C/CCCCT 1 30000 1
1:60000_G/T 1 60000 2
.............. .. ....... ...
I would appreciate any suggestions for methods that are relevant for splitting my chromosomes into intervals. When I have the intervals, I need methods that quickly can test which interval the variant belongs to.
Thanks for your input. I have chosen to create the intervals using a simple loop, to ensure that the intervals were of the desired size.
I created a data.frame with sizes of chromosomes
chrSizes <- data.frame(chromosome = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"), length = c("249250621","243199373", "198022430", "191154276", "180915260", "171115067", "159138663", "146364022", "141213431", "135534747", "135006516", "133851895", "115169878", "107349540", "102531392", "90354753", "81195210", "78077248", "59128983", "63025520", "48129895", "51304566", "155270560", "59373566"), stringsAsFactors = FALSE)
Then I looped through each chromosome creating the intervals, by finding the precise chunk size and then rounding down. The remainder then was used to add one to the first many intervals.
numOfBins <- 10000
chrBinList <- list()
for (i in 1:24) {
chrBins <- c()
chrLength <- as.numeric(chrSizes[i,2])
chunkSize <- floor(chrLength/numOfBins)
remainder <- chrLength %% chunkSize
counter <- 1
# Adding remainder to the first intervals
for (j in 1:(remainder-1)) {
chrBins <- c(chrBins, counter)
counter <- counter + chunkSize + 1
chrBins <- c(chrBins, counter)
}
# Adding normal sized chunks to remaining intervals
for (k in remainder:numOfBins) {
chrBins <- c(chrBins, counter)
counter <- counter + chunkSize
chrBins <- c(chrBins, counter)
}
# Creating a data.frame with intervals
interval.df <- as.data.frame(matrix(chrBins,ncol = 2, byrow = TRUE))
colnames(interval.df) <- c("start", "end")
# Adding to list
chrBinList[[chrSizes[i,1]]] <- interval.df
}
As for testing whether values fall within the different bins, I have come up with a slow solution using apply. I am, however, currently look into the parallel apply functions.