Search code examples
bashbioinformaticsimputationgenome

Convert .gprobs files from Impute2 to PLINK format


I have some imputed .gprobs files (one per chromosome), imputed by Impute2 downloaded from dbGaP, and I need to convert this file into .bed format of PLINK in order to do some analysis.

My .gprobs files look like:

--- rs371609562:61395:CTT:C 61395 CTT C 0 0.023 0.977 0 0.039 0.961 0 0.015 0.985 0 0.026 0.974 0 0 1 0 0 1 0 0 1

Could someone help me find out how convert this kind of file into PLINK format? Or guide me about which files I need to perform the convertion?

P.D.: I know that this question maybe shouldn't be here, but I didn't know any other place to ask about it.


Solution

  • By .gprobs it appears you mean Oxford format, see:

    https://www.cog-genomics.org/plink/1.9/formats#gen

    If this is correct then plink can read in this format as described here:

    https://www.cog-genomics.org/plink/1.9/input#oxford

    In the same command you can output to PLINK binary format:

    plink --gen file.gen --sample file.sample --make-bed --out output_prefix
    

    Note following caveat regarding converting Oxford to PLINK:

    Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls. You can adjust this threshold by providing a numeric parameter to --hard-call-threshold.

    Alternatively, when --hard-call-threshold is given the 'random' modifier, calls are independently randomized according to the probabilities in the file. (This is not ideal; it would be better to randomize in a haploblock-sensitive manner. But resampling a bunch of times with this and generating an empirical distribution of some statistic can still be more informative than applying a single threshold and calculating that statistic once.)

    Source: https://www.cog-genomics.org/plink/1.9/input#oxford