Search code examples
pythonunixawkgrepfastq

find unique first top and bottom lines of fastq file from fasta file


I have 2 files one is fasta file and other one is fastq file. I want to take the fasta, read each line and search each line in the fastq file and print top line and bottom lines. This is what I have

fasta file

read1

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG

for seq in `cat sequences`;do grep -A 2 -B 1 $seq FAP.1.txt;done

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

+

CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############

@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

+ ;@?DBD<@@FFHHH<

@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

+

@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85

@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG

@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########

From this we can see that AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA appeared twice, but I want to print only once. how can I do that?

Final output file

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

+

CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############

@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

+

;@?DBD<@@FFHHH<

@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG

+

@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B


Solution

  • Unless you have a strong reason to do this yourself, use Biopython.

    fasta:

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG
    

    fastq (based on yours but not identical because your output was badly formatted):

    @DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
    +
    CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
    @DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    +
    CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
    @DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    +
    @@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
    @DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
    +
    @CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########
    

    Code:

    from Bio import SeqIO
    
    with open("fasta") as fh:
        fasta = fh.read().splitlines()
    
    seen = set()
    
    for record in SeqIO.parse(open('fastq'), 'fastq'):
        seq = str(record.seq)
        if seq in fasta and seq not in seen:
            seen.add(seq)
            print record.format('fastq')
    

    EDIT: The above prints records in the order of the fastq file, not the fasta file. If order is not important, you should use that method. Otherwise, you can add the records to a dictionary where the key is their index in the FASTA file, and print them all in the end, sorting the dictionary:

    from Bio import SeqIO
    import sys
    
    with open("fasta") as fh:
        fasta = fh.read().splitlines()
    
    seen = set()
    records = {}
    
    for record in SeqIO.parse(open('fastq'), 'fastq'):
        seq = str(record.seq)
        if seq in fasta and seq not in seen:
            seen.add(seq)
            records[fasta.index(seq)] = record
    
    for record in sorted(records):
        sys.stdout.write(records[record].format('fastq'))
    

    (Here I also use sys.stdout.write instead of print, to avoid the extra newlines.)