Search code examples
pythondictionaryfastafastq

python dictionary, make every odd number line to key and even number line to value from a file


Hi, I have a text file like this:

>NM_145914.2:212
TCTGATGGTAAAAGTCGAGGAGAAAGAAGA
>NM_000614.3:1086
ATTCAATTTAAAATCAGACTCTTTAGTTGA
>NM_012096.2:2808
CAGTTAAGGTTTCAAATTGTGGCAGGTGGT
>NM_173465.3:1682
GTGCGTCGGGTGAGAGAGGCCCCAGCGGCC
>NM_001198858.1:490
CAACCACCACAACCTGCTGGTCTGCTCGGT
......more lines in same style......

What I want is:

read from above file, make line 1,3,5,7 ... to dictionary keys and line 2,4,5,8... to dictionary values.

My code is:

query_dict = {}
nameAt = 1
sequenceAt = 2

while name in range(totalLines):
line1 = linecache.getline(filename, nameAt)
line2 = linecache.getline(filename, sequenceAt)

query_dict[line1] = line2
nameAt  = nameAt + 2        
sequenceAt = sequenceAt + 2

The code worked, but its very slow, as the minimal lines of my text file is 200,000 lines. does anyone have better method to do this?

Thanks very much.

==============added follow-up question==================

here is fastq format, with 4 lines per read (record):

@>NM_052972.2:11:1054:1780:889
CTTCGACATCTCCGGCAACCCCTGGATCTG
+>NM_052972.2:11:1054:1780:889
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@>NM_080660.3:12:914:1802:542
CCTGTATGGCTACTGCAACCTCAAGGATAA
+>NM_080660.3:12:914:1802:542
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@>NM_176814.3:712:2706:4242:98
ACAGAGTAAAAGAGAGGCTGACTTAATAAA
+>NM_176814.3:712:2706:4242:98
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
...... more lines in same style ......

i want to create a dictionary, the key is 1st line and the value is 2nd line in each 4 lines of record.

the dictionary would look like:

{'@>NM_052972.2:11:1054:1780:889':'CTTCGACATCTCCGGCAACCCCTGGATCTG', 
 '@>NM_080660.3:12:914:1802:542':'CCTGTATGGCTACTGCAACCTCAAGGATAA',
 '@>NM_176814.3:712:2706:4242:98':'ACAGAGTAAAAGAGAGGCTGACTTAATAAA',
 ..... more keys and values ......
}

thanks.


Solution

  • Something like this:

    with open('filename') as f:
        query_dict = {line.strip():next(f).strip() for line in f}
    

    Output:

    >>> from pprint import pprint
    >>> pprint(query_dict)
    {'>NM_000614.3:1086': 'ATTCAATTTAAAATCAGACTCTTTAGTTGA',
     '>NM_001198858.1:490': 'CAACCACCACAACCTGCTGGTCTGCTCGGT',
     '>NM_012096.2:2808': 'CAGTTAAGGTTTCAAATTGTGGCAGGTGGT',
     '>NM_145914.2:212': 'TCTGATGGTAAAAGTCGAGGAGAAAGAAGA',
     '>NM_173465.3:1682': 'GTGCGTCGGGTGAGAGAGGCCCCAGCGGCC'}
    

    Update:

    with open('foo.txt') as f:
        dic = {}
        for line in f:
            dic[line.strip()] = next(f).strip()
            next(f);next(f)  #Drop next two lines
    from pprint import pprint
    pprint(dic)
    

    Output:

    {'@>NM_052972.2:11:1054:1780:889': 'CTTCGACATCTCCGGCAACCCCTGGATCTG',
     '@>NM_080660.3:12:914:1802:542': 'CCTGTATGGCTACTGCAACCTCAAGGATAA',
     '@>NM_176814.3:712:2706:4242:98': 'ACAGAGTAAAAGAGAGGCTGACTTAATAAA'}