Search code examples
pythontextline

How can I group the lines in a text file based on the contents of each line?


Assume I have a text file containing the following:

12277   17/06/2019  350 BJ201AB FMACRI                  
                    0   J   52  4081.15 166851
                    0   J   52  4496.64 166852
                    0   J   52  5139.07 166855
                    0   J   52  5773.82 166858
                    J   E   70  25  B159681
12509   21/06/2019  443 DH717WF BLANCO                  
                    B   J   42  5376.63 5164/A
12504   21/06/2019  443 EB631NF LUCCIG                  
                    B   J   44  5567.46 5165/A
                    0   J   52  5347.58 166950
                    0   J   52  4742.4  166953
                    0   J   18  1146.24 427876
                    0   J   4   0.4 427877
                    J   0   372 1   B159763
                    R   0   1567    1   B159764

Assuming I would read the file like this:

with open('/home/pexp1/mezzi/INPUT') as f:
    lines = f.readlines()
data = [(line.rstrip()).split('\t') for line in lines]

What would be the correct approach to group every line that starts with something (an int, a string etc) with every other line underneath it, up until a new line that follows the above rule is found? Assuming I would like to call the line that respects the rule and get everything in its group, what data structure would be best to group these lines together?

EDIT: Apologies for the lack of clarity. If I run the code above I get this when I run print(data):

[
    ['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI'],
    ['', '', '', '', '', '0', 'J', '52', '4081.15', '166851'],
    ['', '', '', '', '', '0', 'J', '52', '4496.64', '166852'],
    ['', '', '', '', '', '0', 'J', '52', '5139.07', '166855'],
    ['', '', '', '', '', '0', 'J', '52', '5773.82', '166858'],
    ['', '', '', '', '', 'J', 'E', '70', '25', 'B159681'],
    ['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO'],
    ['', '', '', '', '', 'B', 'J', '42', '5376.63', '5164/A'],
    ['12504', '21/06/2019', '443', 'EB631NF', 'LUCCIG'],
    ['', '', '', '', '', 'B', 'J', '44', '5567.46', '5165/A'],
    ...
]

As you can see, it's a list of lists. How can I group these items together, in such a way that lists that contain an item at index position 0 (in this case, 12277, 122509, etc) are grouped together with the lists that follow bellow them (with no elements at index pos. 0, 1, 2, 3, 4)?

Example:

['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI']

grouped with ['', '', '', '', '', '0', 'J', '52', '4081.15', '166851'], ['', '', '', '', '', '0', 'J', '52', '4496.64', '166852'], etc. up until the next line containing an element at index 0: ['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO']

EDIT2: This is the solution I came up with:

shipments = []
shuttle_lst = []

for line in data[1:]:
    if len(line[0]) < 1:
        shipments.append(line)
    else:
        shuttle = data[data.index(line) - (len(shipments) + 1)]
        shipments.append(shuttle)
        new_lst = [lst for lst in shipments]
        shuttle_lst.append(new_lst)
        shipments.clear()

This creates a list of lists where each header becomes the last element of that list.


Solution

  • If I understand correctly you want to group the lines based on the header line which is the one that does not start with space right?

    Consider the following:

    import pprint
    pp = pprint.PrettyPrinter(indent=4)
    
    # A list of lists
    data = []
    
    with open('data.dat') as f:
        for line in f:
            if line.startswith(" ") or line.startswith("\t"):
                if not data:
                    raise RuntimeError("Wrong data - first line is not legit")
                data[-1].append(line.split())
                continue
    
            # If here, this is a header line
            data.append([line.split()])
    
    pp.pprint(data)
    

    This prints:

    [   [   ['12277', '17/06/2019', '350', 'BJ201AB', 'FMACRI'],
            ['0', 'J', '52', '4081.15', '166851'],
            ['0', 'J', '52', '4496.64', '166852'],
            ['0', 'J', '52', '5139.07', '166855'],
            ['0', 'J', '52', '5773.82', '166858'],
            ['J', 'E', '70', '25', 'B159681']],
        [   ['12509', '21/06/2019', '443', 'DH717WF', 'BLANCO'],
            ['B', 'J', '42', '5376.63', '5164/A']],
        [   ['12504', '21/06/2019', '443', 'EB631NF', 'LUCCIG'],
            ['B', 'J', '44', '5567.46', '5165/A'],
            ['0', 'J', '52', '5347.58', '166950'],
            ['0', 'J', '52', '4742.4', '166953'],
            ['0', 'J', '18', '1146.24', '427876'],
            ['0', 'J', '4', '0.4', '427877'],
            ['J', '0', '372', '1', 'B159763'],
            ['R', '0', '1567', '1', 'B159764']]]
    

    The result is a list of lists (of lists!). Each 2nd level list first item is the header line while the rest are the lines in that group