Search code examples
pythontxt

Extract data lines from txt for each ID if number of lines for each varies


Suppose I have data in txt format structed as this:

|-------------------------|
| ID 1                    |
| Category A              |
| Class ABC               |
| Type 1                  |
| Subcat A                |
| positional entry 1      |
| positional entry 2      |
|-------------------------|
| ID 2                    |
| Category B              |
| Class ABC               |
| Type 2                  |
| Subcat B                |
| positional entry 1      |
| positional entry 2      |
| positional entry 3      |
|-------------------------|
| ID 3                    |
| Category A              |
| Class E                 |
| Type 4                  |
| Subcat A                |
| positional entry 1      |
|-------------------------|

The data is stored in one large txt file (approx. 7 GB) with more than 100 million rows. I want to extract only those IDs that fall into Category A and Subcat A. However, I also want the positional entries. These are not fixed, it is unknown how many lines there are. So for each ID it can vary.

I tried to go through it with opening it as txt file and going through each line. My problem here is that each time the file pointer enters a new line the information so to say is lost, although I could try to set flags that are retained.

Second approach was to extract the beginning and end of each ID in a list first. Then check the position where an ID starts that has category A and Subcat A. However, I have many rows, so storing these information in ranges with lists that have so many elements is not possible. I wanted to check then for each ID to which range it falls.

Expected output:

|-------------------------|
| ID 1                    |
| Category A              |
| Class ABC               |
| Type 1                  |
| Subcat A                |
| positional entry 1      |
| positional entry 2      |
|-------------------------|
| ID 3                    |
| Category A              |
| Class E                 |
| Type 4                  |
| Subcat A                |
| positional entry 1      |
|-------------------------|

How can I do this extraction?

Edit: positional entry 1,2 and so on just means this can be some varying entries. So these are lines with for example text entries which I need for later analysis.

Edit 2 according to Zach Young's answer:

When I adopt the code to the following:

import csv
DIVIDER = "-------------------------"

f_in = open(r"C:\myfile\testfile.txt")
block: list[str] = []

with open(r"C:\myfile\output.csv", "w", newline="") as f_out:
    writer = csv.writer(f_out)
    for line in f_in:
        line = line.replace("|", "").strip()
        if line == DIVIDER:
            if len(block) > 4 and block[1] == "Category A" and block[4] == "Subcat A":
                print("check")
                print(block)
                writer.writerow(block)

            block = []
            continue

        block.append(line)

Then I get the ouptut as in the answer below, however I do not want transposing, I would like to have the expected output as I wrote it here.


Solution

  • We can filter the data by reading only a minimal number of header lines, without having to read the entire block into memory.

    from itertools import takewhile
    
    # specify file parameters
    header_height = 5
    category_position = 1
    subcat_position = 4
    is_separator = lambda s: s[1] == '-'
    
    with (
        open('input.file', 'r') as data, 
        open('output.file', 'w') as out
    ):
        # find the first block separator
        for line in data:
            if is_separator(line):
                sep = line       
                out.write(sep)
                break            # read the rest later
        else:
            raise ValueError('No starting point found')
        # define slicing to extract category and subcat values
        category_body = slice(len('| Category '), len(sep) - len('|\n'))
        subcat_body = slice(len('| Subcat '), len(sep) - len('|\n'))
        while True:
            # read the header or exit if the data is exhausted
            try:
                header = [next(data) for _ in range(header_height)]
            except StopIteration:
                break    # exit point
            # find the category and subcat values
            category = header[category_position][category_body].strip()
            subcat = header[subcat_position][subcat_body].strip()
            if category == 'A' and subcat == 'A':
                # сopy data with suitable category and subcategory values
                out.writelines(header)
                out.writelines(takewhile(lambda line: line != sep, data))
                out.write(sep)
            else:
                # skip the lines up to the next block for the rest blocks
                while next(data) != sep: pass