Search code examples
pythonpython-itertoolstext-parsing

Parsing blocks of text data with python itertools.groupby


I'm trying to parse a blocks of text in python 2.7 using itertools.groupby The data has the following structure:

BEGIN IONS
TITLE=cmpd01_scan=23
RTINSECONDS=14.605
PEPMASS=694.299987792969 505975.375
CHARGE=2+
615.839727 1760.3752441406
628.788226 2857.6264648438
922.4323436 2458.0959472656
940.4432533 9105.5
END IONS
BEGIN IONS
TITLE=cmpd01_scan=24
RTINSECONDS=25.737
PEPMASS=694.299987792969 505975.375
CHARGE=2+
575.7636234 1891.1656494141
590.3553938 2133.4477539063
615.8339562 2433.4252929688
615.9032114 1784.0628662109
END IONS

I need to extract information from the line beigining with "TITLE=", "PEPMASS=","CHARGE=".

The code I'm using as follows:

import itertools
import re

data_file='Test.mgf'
def isa_group_separator(line):
    return line=='END IONS\n'

regex_scan = re.compile(r'TITLE=')
regex_precmass=re.compile(r'PEPMASS=')
regex_charge=re.compile(r'CHARGE=')


with open(data_file) as f:
    for (key,group) in itertools.groupby(f,isa_group_separator):
        #print(key,list(group)) 
        if not key:
            precmass_match = filter(regex_precmass.search,group)
            print precmass_match            

            scan_match= filter(regex_scan.search,group)
            print scan_match

            charge_match = filter(regex_charge.search,group)
            print charge_match 

However, the output only picks up the "PEPMASS=" line,and if 'scan_match' assignment is done before 'precmass_match', the "TITLE=" line is printed only;

> ['PEPMASS=694.299987792969 505975.375\n'] [] []
> ['PEPMASS=694.299987792969 505975.375\n'] [] []

can someone point out what I'm doing wrong here?


Solution

  • The reason for this is that group is an iterator and it runs only once. Please find the modified script that does the job.

    import itertools
    import re
    
    data_file='Test.mgf'
    
    
    def isa_group_separator(line):
        return line == 'END IONS\n'
    
    
    regex_scan = re.compile(r'TITLE=')
    regex_precmass = re.compile(r'PEPMASS=')
    regex_charge = re.compile(r'CHARGE=')
    
    
    with open(data_file) as f:
        for (key, group) in itertools.groupby(f, isa_group_separator):
            if not key:
                g = list(group)
    
                precmass_match = filter(regex_precmass.search, g)
                print precmass_match
    
                scan_match = filter(regex_scan.search, g)
                print scan_match
    
                charge_match = filter(regex_charge.search, g)
                print charge_match