Search code examples
bashawkvcf-vcardfastqsequencing

Binning Together Allele Frequencies From VCF Sequencing Data


I have a sequencing datafile containing base pair locations from the genome, that looks like the following example:

chr1 814 G A 0.5
chr1 815 T A 0.3
chr1 816 C G 0.2
chr2 315 A T 0.3
chr2 319 T C 0.8
chr2 340 G C 0.3
chr4 514 A G 0.5

I would like to compare certain groups defined by the location of the bp found in column 2. I then want the average of the numbers in column 5 of the matching regions.

So, using the example above lets say I am looking for the average of the 5th column for all samples spanning chr1 810-820 and chr2 310-330. The first five rows should be identified, and their 5th column numbers should be averaged, which equals 0.42.

I tried creating an array of ranges and then using awk to call these locations, but have been unsuccessful. Thanks in advance.


Solution

  • import pandas as pd
    from StringIO import StringIO
    
    s = """chr1 814 G A 0.5
    chr1 815 T A 0.3
    chr1 816 C G 0.2
    chr2 315 A T 0.3
    chr2 319 T C 0.8
    chr2 340 G C 0.3
    chr4 514 A G 0.5"""
    
    sio = StringIO(s)
    df = pd.read_table(sio, sep=" ", header=None)
    df.columns=["a", "b", "c", "d", "e"]
    
    # The query expression is intuitive 
    r = df.query("(a=='chr1' & 810<b<820) | (a=='chr2' & 310<b<330)")
    print r["e"].mean()
    

    pandas might be better for such tabular data processing, and it's python.