Search code examples
pythonweb-scrapingscrapy

scrapy intercepts not all of the markup that comes in the request


I'm trying to intercept the markup that comes in http packets, but I only get part of that markup. For some reason it cuts off in the middle. Is it related to that? Here is my code:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging


class StackOverflowSpider(scrapy.Spider):
    
    name = 'stackoverflow'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['https://stackoverflow.com/questions/tagged/python?tab=newest&page=1&pagesize=15']
    first_request_done = False
    
    def start_requests(self):
        if not self.first_request_done:
            self.first_request_done = True
            for url in self.start_urls:
                yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
            
    def parse(self, response):
        if response.status == 200 and response.headers.get('Content-Type', '').startswith(b'text/html'):
            html = response.body.decode('utf-8')
            print(html)
        
        yield
    

configure_logging()
process = CrawlerProcess(settings={
    'LOG_ENABLED': False,
    'DOWNLOAD_DELAY': 1,
    'CONCURRENT_REQUESTS': 1
})
process.crawl(StackOverflowSpider)
process.start(stop_after_crawl=False)

Solution

  • This is just the python print function not properly flushing the output... This can be demonstrated by spliting the page content into lines and printing them out one at a time, or alternatively writing the contents to a file and viewing the full output in the written file.

    For example, you can try this to print it out line by line:

    def parse(self, response):
        for line in response.text.splitlines():
            print(line)
    

    or if you wanted to write the contents to a file:

    def parse(self, response):
        with open('response.html', "wt", encoding="utf8") as htmlfile:
            htmlfile.write(response.text)
        ...
        ...