Search code examples
pandasbioinformaticssamtools

How to most efficiently retrieve data from NarrowPeak (BED6+4) format files?


I am working on a bioinformatics project that involves very big NarrowPeak formated files that look like this:

(the columns are 'chrom ,chromStart, chromEnd, name, score ,strand, signalValue ,pValue, qValue ,peak')


chr1    713835  714424  chr1.1  1000    .   0.1621  10.6    -1  253
chr1    752775  753050  chr1.2  567     .   0.0365  2.09    -1  124
chr1    753373  753448  chr1.3  511     .   0.0243  1.31    -1  27
chr1    762057  763210  chr1.4  1000    .   0.1743  11.5    -1  846
chr1    793358  793612  chr1.5  574     .   0.0379  2.18    -1  121
chr1    805101  805538  chr1.6  783     .   0.0834  5.23    -1  200
chr1    839626  840461  chr1.7  1000    .   0.2079  13.8    -1  510
chr1    842125  842534  chr1.8  641     .   0.0526  3.15    -1  199
chr2    846334  846381  chr1.9  510     .   0.0241   1.3    -1  17
chr2    846545  846937  chr1.10 562     .   0.0355  2.03    -1  198
chr2    848219  848588  chr1.11 605     .   0.0448  2.64    -1  179
chr2    851784  852887  chr1.12 734     .   0.0728  4.51    -1  323

I have a way of indexing the file (using tabix in samtools) and querying a row by entering the chr number and range and getting the right row (I am using it because my teammate says it is the fastest way).

for example if my query is chr1:713835-714424 I will get the result:

chr1    713835  714424  chr1.1  1000    .   0.1621  10.6    -1  253

I want to be able to enter the chr num and range and get the score 253 alone.

I did it by turning the result to a data frame (with pandas).

def turn_query_results_into_df(query_results):
    replaced=(query_results.replace('\t',','))
    stripped = (replaced.strip())
    lines = (replaced.split("\n") )
    data=[]
    for line in lines:
        data.append((line.split(',')))
    mydf1=pd.DataFrame(data,columns=('chrom chromStart chromEnd name score strand signalValue pValue qValue peak'.split()))
    mydf1=mydf1.drop(mydf1.tail(1).index)
    mydf1=mydf1.astype({'chromStart':'int32','chromEnd':'int32','score':'int32','signalValue':'float64','pValue':'float64','peak':'float64'})
    return mydf1

print(peak_val = df1['peak'].values)

Is there a better way?

If there is a way using samtools, it would be the best, but I am having a hard time understanding it.

Thanks


Solution

  • Pandas is serious overkill. If you are using tabix for your querying too, with a command sequence like:

    $ tabix input.bed.gz # index the input
    $ tabix input.bed.gz chr1:713835-714424 # query the input
    chr1    713835  714424  chr1.1  1000    .   0.1621  10.6    -1  253
    

    You can just pipe the output of your tabix query to the Unix cut utility to select only the 10th column:

    $ tabix input.bed.gz chr1:713835-714424 | cut -f 10
    253