Search code examples
pythonpython-2.7fastq

read fastq file into dictionary


I have a fastq file like this (part of the file):

@A80HNBABXX:4:1:1344:2224#0/1
AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG  
+
\\YYWX\PX^YT[TVYaTY]^\^H\`^`a`\UZU__TTbSbb^\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB  
@A80HNBABXX:4:1:1515:2211#0/1
TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA  
+  
ee^e^\`ad`eeee\dd\ddddYeebdd\ddaYbdcYc`\bac^YX[V^\Ybb]]^bdbaZ]ZZ\^K\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_  
@A80HNBABXX:4:1:1538:2220#0/1
CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT  
+
fff^fd\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\a\`]aY]ZZ[XYcccYcZ\\]Y  
@A80HNBABXX:4:1:1666:2222#0/1
CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT  
+
deeee`bbcddddad\bbbbeee\ecYZcc^dd^ddd\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB

The FASTQ file uses four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier. Line 2 is the DNA sequence letters. Line 3 begins with a '+' character. Line 4 encodes the quality values for the sequence in Line 2 (the part after "+" and before the next "@", and must contain the same number of symbols as letters in the sequence.

i want to read the fastq file into a dictionary like this (the key is the DNA sequence and the value is the quality value, and the line starting with "@" and "+" can be discarded):

{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG':'\YYWX\PX^YT[TVYaTY]^\^H`^a\UZU__TTbSbb^\a^^^[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
 'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT':'fff^fd\c^d^Ycacdcdcdedeffdfedb]beeeeecd^ddccdddddfffeaeeeffdTecacaLV[QRPa\a`]aY]ZZ[XYcccYcZ\]Y ',
    ....}

I write the following code but it does not give me what I want. Can anyone help me to fix/improve my code?

class fastq(object):
def __init__(self,filename):
    self.filename = filename
    self.__sequences = {}

def parse_file(self):
    symbol=['@','+']
    """Stores both the sequence and the quality values for the sequence"""
    f = open(self.filename,'rU')
    for lines in self.filename:
        if symbol not in lines.startwith()
        data = f.readlines()
return data

Solution

  • Here's a pretty quick and efficient way of doing it:

    def parse_file(self):
        with open(self.filename, 'r') as f:
            content = f.readlines()
    
            # Recreate content without lines that start with @ and +
            content = [line for line in content if not line[0] in '@+']
    
            # Now the lines you want are alternating, so you can make a dict
            # from key/value pairs of lists content[0::2] and content[1::2]
            data = dict(zip(content[0::2], content[1::2]))
    
        return data