Search code examples
pythonparsingannotationssequencebed

ChIP-seq Analysis: Extracting Sequences and Generating Bed File


I'm Logan who study Bioinformatics.

I have analyzed ChIP-seq data by conducting peak calling with MACS3 and annotating the peaks using HOMER. The output provides coordinates for each peak, such as those for the region CM025010.1 (Start: 40194025, End: 40194493) mentioned below. Now, my objective is to extract sequences within a ±1kb window around each peak position and subsequently generate a bed file. I would appreciate any ideas or suggestions you may have. Thank you.

PeakID (cmd=annotatePeaks.pl refGenome.fa)  Chr Start   End Strand  Peak Score  Focus Ratio/Region Size Annotation  Detailed
peakCalling_q_0.01_peak_16  CM025010.1  40194025    40194493    +   311 NA  NA  NA  NA  NA
peakCalling_q_0.01_peak_50  CM025021.1  12981866    12982368    +   279 NA  NA  NA  NA  NA
peakCalling_q_0.01_peak_27  CM025012.1  6509890 6510225 +   266 NA  NA  NA  NA  NA
peakCalling_q_0.01_peak_45  CM025021.1  12517853    12518147    +   246 NA  NA  NA  NA  NA

Solution

  • import pandas as pd
    
    df = pd.read_csv('original_peaks.bed', sep='\t', header=None, names=['your', 'headers', 'here', 'a', 'b', 'etc'])
    
    window_size = 1000
    
    df['Start'] = df['Start'] - window_size
    df['End'] = df['End'] + window_size
    
    #### Clip the extended regions to ensure they don't fall below 0
    df['start'] = df['start'].clip(lower=0)
    
    #### Save the extended regions to a new BED file
    df.to_csv('extended_peaks.bed', sep='\t', header=False, index=False)