Suppose I have data in txt format structed as this:
|-------------------------|
| ID 1 |
| Category A |
| Class ABC |
| Type 1 |
| Subcat A |
| positional entry 1 |
| positional entry 2 |
|-------------------------|
| ID 2 |
| Category B |
| Class ABC |
| Type 2 |
| Subcat B |
| positional entry 1 |
| positional entry 2 |
| positional entry 3 |
|-------------------------|
| ID 3 |
| Category A |
| Class E |
| Type 4 |
| Subcat A |
| positional entry 1 |
|-------------------------|
The data is stored in one large txt file (approx. 7 GB) with more than 100 million rows. I want to extract only those IDs that fall into Category A and Subcat A. However, I also want the positional entries. These are not fixed, it is unknown how many lines there are. So for each ID it can vary.
I tried to go through it with opening it as txt file and going through each line. My problem here is that each time the file pointer enters a new line the information so to say is lost, although I could try to set flags that are retained.
Second approach was to extract the beginning and end of each ID in a list first. Then check the position where an ID starts that has category A and Subcat A. However, I have many rows, so storing these information in ranges with lists that have so many elements is not possible. I wanted to check then for each ID to which range it falls.
Expected output:
|-------------------------|
| ID 1 |
| Category A |
| Class ABC |
| Type 1 |
| Subcat A |
| positional entry 1 |
| positional entry 2 |
|-------------------------|
| ID 3 |
| Category A |
| Class E |
| Type 4 |
| Subcat A |
| positional entry 1 |
|-------------------------|
How can I do this extraction?
Edit: positional entry 1,2 and so on just means this can be some varying entries. So these are lines with for example text entries which I need for later analysis.
Edit 2 according to Zach Young's answer:
When I adopt the code to the following:
import csv
DIVIDER = "-------------------------"
f_in = open(r"C:\myfile\testfile.txt")
block: list[str] = []
with open(r"C:\myfile\output.csv", "w", newline="") as f_out:
writer = csv.writer(f_out)
for line in f_in:
line = line.replace("|", "").strip()
if line == DIVIDER:
if len(block) > 4 and block[1] == "Category A" and block[4] == "Subcat A":
print("check")
print(block)
writer.writerow(block)
block = []
continue
block.append(line)
Then I get the ouptut as in the answer below, however I do not want transposing, I would like to have the expected output as I wrote it here.
We can filter the data by reading only a minimal number of header lines, without having to read the entire block into memory.
from itertools import takewhile
# specify file parameters
header_height = 5
category_position = 1
subcat_position = 4
is_separator = lambda s: s[1] == '-'
with (
open('input.file', 'r') as data,
open('output.file', 'w') as out
):
# find the first block separator
for line in data:
if is_separator(line):
sep = line
out.write(sep)
break # read the rest later
else:
raise ValueError('No starting point found')
# define slicing to extract category and subcat values
category_body = slice(len('| Category '), len(sep) - len('|\n'))
subcat_body = slice(len('| Subcat '), len(sep) - len('|\n'))
while True:
# read the header or exit if the data is exhausted
try:
header = [next(data) for _ in range(header_height)]
except StopIteration:
break # exit point
# find the category and subcat values
category = header[category_position][category_body].strip()
subcat = header[subcat_position][subcat_body].strip()
if category == 'A' and subcat == 'A':
# сopy data with suitable category and subcategory values
out.writelines(header)
out.writelines(takewhile(lambda line: line != sep, data))
out.write(sep)
else:
# skip the lines up to the next block for the rest blocks
while next(data) != sep: pass