Search code examples
rcountbinning

Make bins in 100000 intervals and count occurrences with R


I'm working with genomic data in this format:

chr    start    end   lengthabs_summit  pileup  X.log10.pvalue. fold_enrichment X.log10.qvalue. name
chr1    29017   29389   373 29358   31  28.59002    11.7551 24.95703    7_peak_2
chr1    569886  569978  93  569924  1334    334.59555   3.66639 329.13641   7_peak_13
chr1    713775  714591  817 714238  63  57.55214    14.98049    53.18887    7_peak_16
chr1    1009170 1009766 597 1009354 57  29.6026 6.49704 25.93788    7_peak_38
chr1    1013682 1014753 1072    1014285 45  22.68048    6.00323 19.24049    7_peak_39
chr1    1051283 1052033 751 1051691 49  34.32018    9.31181 30.51424    7_peak_43
chr1    1071957 1072489 533 1072064 36  20.45083    6.56582 17.09022    7_peak_46
chr1    1079500 1080408 909 1079994 36  21.25546    6.87813 17.8657 7_peak_47
chr1    1085553 1085793 241 1085681 32  20.59002    7.39226 17.22433    7_peak_48
chr1    1092859 1093875 1017    1092953 55  32.86424    7.69885 29.10045    7_peak_49
chr1    1098076 1098442 367 1098157 51  25.19468    6.00704 21.67023    7_peak_50
chr1    1167340 1167771 432 1167457 46  34.94157    10.2791 31.11741    7_peak_57
chr1    1310568 1311013 446 1310739 75  61.06957    12.93319    56.63967    7_peak_73
chr1    1334658 1335005 348 1334903 41  32.4828 10.54771    28.73031    7_peak_74
chr1    1368673 1368922 250 1368819 39  20.83713    6.22806 17.46213    7_peak_77
chr1    1407006 1407170 165 1407136 29  23.68931    9.70474 20.21472    7_peak_81
chr1    1446997 1447660 664 1447477 35  25.84261    9.0858  22.29687    7_peak_83
chr1    1550552 1551647 1096    1550765 42  27.55648    8.18824 23.95619    7_peak_87
chr1    1562564 1563038 475 1562809 45  27.52078    7.59892 23.92145    7_peak_88
chr1    1623807 1625030 1224    1624276 59  40.35566    9.39971 36.38159    7_peak_96
chr1    1655573 1656140 568 1655902 44  38.03923    12.27166    34.12801    7_peak_98
chr1    1677697 1678421 725 1677814 46  30.71495    8.58012 27.01606    7_peak_101
chr1    1690209 1690798 590 1690462 55  37.97549    9.38048 34.06614    7_peak_102
chr1    1850605 1851273 669 1850915 58  30.82379    6.7014  27.12157    7_peak_108
chr1    1981599 1982178 580 1981750 44  29.74246    8.62567 26.07388    7_peak_109
chr1    2121014 2121503 490 2121181 44  25.97852    7.22808 22.42829    7_peak_115
chr1    2130779 2131029 251 2130922 57  30.68925    6.78891 26.99122    7_peak_118
chr1    2158733 2159503 771 2159309 52  35.02846    8.9443  31.2017 7_peak_123
chr1    2322758 2323284 527 2323118 47  34.27391    9.75263 30.46929    7_peak_129
chr1    2343877 2344464 588 2344122 45  23.81217    6.35414 20.33326    7_peak_131
chr1    2457479 2458104 626 2457738 41  27.63569    8.43239 24.03328    7_peak_136
chr1    2507171 2507610 440 2507387 40  22.07389    6.50842 18.65457    7_peak_141
chr1    2517776 2518527 752 2517982 79  54.66531    10.1156 50.35995    7_peak_144
chr1    3104749 3105340 592 3105042 39  26.23199    8.29302 22.67383    7_peak_168
chr1    3339907 3340297 391 3340051 61  47.4887 11.4835 43.33681    7_peak_183
chr1    3541145 3541844 700 3541432 33  22.2239 7.90376 18.79962    7_peak_194
chr1    3712982 3713209 228 3713146 25  21.03679    9.46547 17.65467    7_peak_204
chr1    3773318 3774375 1058    3773903 71  64.18323    15.20667    59.69656    7_peak_206
chr1    3816748 3818236 1489    3817402 58  40.40163    9.61359 36.42624    7_peak_210
chr1    6052087 6052758 672 6052606 55  44.57815    11.90162    40.49594    7_peak_218
chr1    6086130 6086460 331 6086283 26  21.8022 9.58904 18.39271    7_peak_220
chr1    6259449 6259894 446 6259711 48  42.85861    13.27342    38.81911    7_peak_223
chr1    6453259 6454267 1009    6453833 36  25.9895 8.89626 22.43882    7_peak_236
chr1    6639866 6640271 406 6640031 44  35.75193    11.19049    31.90473    7_peak_243
chr1    6673060 6674146 1087    6673629 61  46.69005    11.1878 42.55659    7_peak_248
chr1    6844434 6845552 1119    6845378 72  58.2036 12.66662    53.8277 7_peak_252
chr1    6882651 6882812 162 6882746 21  22.99598    11.9154 19.54511    7_peak_255
chr1    7325838 7326444 607 7326032 32  24.22423    9.10923 20.73225    7_peak_258
chr1    7338199 7338451 253 7338410 23  20.28285    9.65393 16.92857    7_peak_259
chr1    7843899 7844833 935 7844068 50  38.84025    10.87309    34.90662    7_peak_266
chr1    7945594 7945913 320 7945805 37  40.12659    16.04772    36.15866    7_peak_267
chr1    8013883 8014418 536 8014328 29  24.7682 10.29467    21.25742    7_peak_269
chr1    8021299 8021991 693 8021619 78  76.90004    18.15693    72.21448    7_peak_270
chr1    8763179 8763705 527 8763447 45  41.29927    13.54395    37.30036    7_peak_297
chr1    8877609 8877845 237 8877792 24  20.69754    9.58204 17.32788    7_peak_299
chr1    9222907 9223400 494 9223017 44  30.50605    8.92885 26.81356    7_peak_310
chr1    9294465 9295131 667 9294997 34  23.79729    8.38562 20.31876    7_peak_316
chr1    9488859 9489215 357 9489096 33  35.37181    14.91643    31.53497    7_peak_323
chr1    9599244 9600007 764 9599346 38  30.08358    10.27689    26.40452    7_peak_325

Where the important columns are chr, start and end. For each chromosome, I want to make bins every 100kb, bin each row into one bin depending on the start position and then count the number of occurrences in each bin to compare the distribution between samples.

I'm having trouble defining the bins. I've seen that "cut" is very used for this, but since I don't have defined cutting points and it varies in each chromosome I'm not sure is the appropriate command.

bin_size = 100000
for (x in levels(df$chr)) { # For each chromosome
  number_groups = max(df$end)/bin_size # Number of bins
    # How to use cut here?         
}

Solution

  • You can use aggregate for this. Using the data from thisisg:

    aggregate(end ~ chr + start%/%100000, data=test, FUN=length)
    ##     chr start%/%1e+05 end
    ## 1  chr1             0   1
    ## 2  chr1             5   1
    ## 3  chr1             7   1
    ## 4  chr1            10   8
    ...
    

    The names then can be changed in the result. end is the count here, as that is the name on the left side of the ~ in the formula. Any column would do, as we're simply counting the number of elements with length.