How to properly read large html in chunks with .iter_content?

So, I'm a very amateur python programmer but hope all I'll explain makes sense.

I want to scrape a type of Financial document called "10-K". I'm just interested in a little part of the whole document. An example of the URL I try to scrape is: https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt

Now, if I download this document as a .txt, It "only" weights 12mb. So for my ignorance doesn't make much sense this takes 1-2 min to .read() (even I got a decent PC).

The original code I was using:

from urllib.request import urlopen
url = 'https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt'

response = urlopen(url)
document = response.read()

After this I was basically dividing the whole document into portions <DOCUMENT>data</DOCUMENT>, and using a for loop for searching if inside every document data it was present some keywords like <strong>CONSOLIDATED BALANCE SHEETS that told me there was a the table I wanted to scrape. All this in a regular manner (can share code if needed), because I've tried bs4 and other parsers and was a PITA for my low-level. The correct document with the table parsing was done using df.read_html()

So now my approach is this:

import requests
KeyWord = b'<strong>CONSOLIDATED BALANCE SHEETS'
interesting_chunk = b''

document = requests.get(url)

for chunk in document.iter_content(10000):
     if KeyWord in chunk:
          interesting_chunk = chunk
     else:
          continue

And after this, I search for the start and the end of the <DOCUMENT>

doc_start_pos = interesting_chunk.find(b'<DOCUMENT>')
doc_end_pos  = interesting_chunk[doc_start_pos:].find(b'</DOCUMENT>')

final_document = interesting_chunk[doc_start_pos:doc_end_pos]

Problems here:

KeyWord could be split between two chunks so I wouldn't find it.
Same for <DOCUMENT> start and end or even these doesn't appear at all inside the chunk.

So I've thought in using another string to save the previous chunk in the loop, so if I find KeyWord, I'm still able to sum previous and current chunk and find the DOCUMENT start, and with the end, I could continue iteration until next </DOCUMENT>

But with the problem of a split KeyWord, Idk how to handle it. It's random, it's a large file, and is unlikely, but if I use small chunks it's not that difficult. How I avoid a KeyWord split between two chunks?

Also IDK what should be the optimal size of the chunks...

Solution

The time it takes to read a document over the internet is really not related to the speed of your computer, at least in most cases. The most important determinant is the speed of your internet connection. Another important determinant is the speed with which the remote server responds to your request, which will depend in part on how many other requests the remote server is currently trying to handle.

It's also possible that the slow-down is not due to either of the above causes, but rather to measures taken by the remote server to limit scraping or to avoid congestion. It's very common for servers to deliberately reduce responsiveness to clients which make frequent requests, or even to deny the requests entirely. Or to reduce the speed of data transmission to everyone, which is another way of controlling server load. In that case, there's not much you're going to be able to do to speed up reading the requests.

From my machine, it takes a bit under 30 seconds to download the 12MB document. Since I'm in Perú it's possible that the speed of the internet connection is a factor, but I suspect that it's not the only issue. However, the data transmission does start reasonably quickly.

If the problem were related to the speed of data transfer between your machine and the server, you could speed things up by using a streaming parser (a phrase you can search for). A streaming parser reads its input in small chunks and assembles them on the fly into tokens, which is basically what you are trying to do. But the streaming parser will deal transparently with the most difficult part, which is to avoid tokens being split between two chunks. However, the nature of the SEC document, which taken as a whole is not very pure HTML, might make it difficult to use standard tools.

Since the part of the document you want to analyse is well past the middle, at least in the example you presented, you won't be able to reduce the download time by much. But that might still be worthwhile.

The basic approach you describe is workable, but you'll need to change it a bit in order to cope with the search strings being split between chunks, as you noted. The basic idea is to append successive chunks until you find the string, rather than just looking at them one at a time.

I'd suggest first identifying the entire document and then deciding whether it's the document you want. That reduces the search issue to a single string, the document terminator (\n</DOCUMENT>\n; the newlines are added to reduce the possibility of false matches).

Here's a very crude implementation, which I suggest you take as an example rather than just copying it into your program. The function docs yields successive complete documents from a url; the caller can use that to select the one they want. (In the sample code, the first matching document is used, although there are actually two matches in the complete file. If you want all matches, then you will have to read the entire input, in which case you won't have any speed-up at all, although you might still have some savings from not having to parse everything.)

from urllib.request import urlopen
def docs(url):
    with urlopen(url) as f:
        buff = b''
        fence = b'\n</DOCUMENT>\n'
        while True:
            chunk = f.read(65536)
            if not chunk: break
                start = max(0, len(buff) - len(fence))
                buff += chunk
                end = buff.find(fence, start)
                if end != -1: 
                    end += len(fence)
                    yield buff[find(buff, b'<DOCUMENT>'):end]
        buff = buff[end:]

url = 'https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt'
keyword = b'<strong>CONSOLIDATED BALANCE SHEETS'

for document in docs(url):
    if keyword in document:
        # Process document
        break