Search code examples
pythonfilepython-2.7file-read

How to read a file block-wise in python


I am bit stuck in reading a file block-wise, and facing difficulty in getting some selective data in each block :

Here is my file content :

DATA.txt

#-----FILE-----STARTS-----HERE--#
#--COMMENTS CAN BE ADDED HERE--#

BLOCK IMPULSE DATE 01-JAN-2010 6 DEHDUESO203028DJE \
    SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=1021055:lr=1: \
    USERID=ID=291821 NO_USERS=3 GROUP=ONE id_info=1021055 \
    CREATION_DATE=27-JUNE-2013 SN=1021055  KEY ="22WS \
    DE34 43RE ED54 GT65 HY67 AQ12 ES23 54CD 87BG 98VC \
    4325 BG56"

BLOCK PASSION DATE 01-JAN-2010 6 DEHDUESO203028DJE \
    SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=324356:lr=1: \
    USERID=ID=291821 NO_USERS=1 GROUP=ONE id_info=324356 \
    CREATION_DATE=27-MAY-2012 SN=324356  KEY ="22WS \
    DE34 43RE 342E WSEW T54R HY67 TFRT 4ER4 WE23 XS21 \
    CD32 12QW"

BLOCK VICTOR DATE 01-JAN-2010 6 DEHDUESO203028DJE \
    SEQUENCE=ai=0:at=221:ae=3:lu=100:lo=NNU:ei=324356:lr=1: \
    USERID=ID=291821 NO_USERS=5 GROUP=ONE id_info=324356 \
    CREATION_DATE=27-MAY-2012 SN=324356  KEY ="22WS \
    DE34 43RE 342E WSEW T54R HY67 TFRT 4ER4 WE23 XS21 \
    CD32 12QW"

#--BLOCK--ENDS--HERE#
#--NEW--BLOCKS--CAN--BE--APPENDED--HERE--#      

I am only interested in Block Name , NO_USERS, and id_info of each block . these three data to be saved to a data-structure(lets say dict), which is further stored in a list :

[{Name: IMPULSE ,NO_USER=3,id_info=1021055},{Name: PASSION ,NO_USER=1,id_info=324356}. . . ]

any other data structure which can hold the info would also be fine.

So far i have tried getting the block names by reading line by line :

fOpen = open('DATA.txt')
unique =[]
for row in fOpen:
    if "BLOCK" in row:
        unique.append(row.split()[1])
print unique

i am thinking of regular expression approach, but i have no idea where to start with. Any help would be appreciate.Meanwhile i am also trying , will update if i get something . Please help .


Solution

  • You could use groupy to find each block, use a regex to extract the info and put the values in dicts:

    from itertools import groupby
    import re
    
    
    with open("test.txt") as f:
        data = []
        # find NO_USERS= 1+ digits or id_info= 1_ digits
        r = re.compile("NO_USERS=\d+|id_info=\d+")
        grps = groupby(f,key=lambda x:x.strip().startswith("BLOCK"))
        for k,v in grps:
            # if k is True we have a block line
            if k:
                # get name after BLOCK
                name = next(v).split(None,2)[1]
                # get lines after BLOCK and get the second of those
                t = next(grps)[1]
                # we want two lines after BLOCK
                _, l = next(t), next(t)
                d = dict(s.split("=") for s in r.findall(l))
                # add name to dict
                d["Name"] = name
                # add sict to data list
                data.append(d)
    
    print(data)
    

    Output:

     [{'NO_USERS': '3', 'id_info': '1021055', 'Name': 'IMPULSE'},
     {'NO_USERS': '1', 'id_info': '324356', 'Name': 'PASSION'}, 
    {'NO_USERS': '5', 'id_info': '324356', 'Name': 'VICTOR'}]
    

    Or without groupby as your file follows a format we just need to extract the second line after the BLOCK line:

    with open("test.txt") as f:
        data = []
        r = re.compile("NO_USERS=\d+|id_info=\d+")
        for line in f:
            # if True we have a new block
            if line.startswith("BLOCK"):
                # call next twice to get thw second line after BLOCK
                _, l = next(f), next(f)
                # get name after BLOCK
                name = line.split(None,2)[1]
                # find our substrings from l 
                d = dict(s.split("=") for s in r.findall(l))
                d["Name"] = name
                data.append(d)
    
    print(data)
    

    Output:

    [{'NO_USERS': '3', 'id_info': '1021055', 'Name': 'IMPULSE'}, 
    {'NO_USERS': '1', 'id_info': '324356', 'Name': 'PASSION'}, 
    {'NO_USERS': '5', 'id_info': '324356', 'Name': 'VICTOR'}]
    

    To extract values you can iterate:

    for dct in data:
        print(dct["NO_USERS"])
    

    Output:

    3
    1
    5
    

    If you want a dict of dicts and to access each section from 1-n you can store as nested dicts using from 1-n as tke key:

    from itertools import  count
    import re
    
    with open("test.txt") as f:
        data, cn = {}, count(1)
        r = re.compile("NO_USERS=\d+|id_info=\d+")
        for line in f:
            if line.startswith("BLOCK"):
                _, l = next(f), next(f)
                name = line.split(None,2)[1]
                d = dict(s.split("=") for s in r.findall(l))
                d["Name"] = name
                data[next(cn)] = d
       data["num_blocks"] = next(cn) - 1
    

    Output:

    from pprint import pprint as pp
    
    pp(data)
    {1: {'NO_USERS': '3', 'Name': 'IMPULSE', 'id_info': '1021055'},
     2: {'NO_USERS': '1', 'Name': 'PASSION', 'id_info': '324356'},
     3: {'NO_USERS': '5', 'Name': 'VICTOR', 'id_info': '324356'},
     'num_blocks': 3}
    

    'num_blocks' will tell you exactly how many blocks you extracted.