Search code examples
python-3.xscrapy

Scrapy error on redirects to PDF file: AttributeError: Response content isn't text


I've got a scrapy spider hosted on Zyte using Smart Proxies.

My spider is fairly simple as it crawls starts from a list of URLs.

the parse method uses a simple linkextractor to extract links on the domain and then crawls those links.

Simplified parse method:

def parse(self, response):    
    internal_le = LinkExtractor(
            allow_domains=tld_t, # try to stay on domain (this is a tldextract of response.url)
            unique=True,  # de-dup
            #deny_extensions=self.deny_extensions
        )
    in_links = internal_le.extract_links(response)

    for link in in_links:
            if link.url:
                
                yield Request(
                    link.url,
                    callback=self.parse,
                    
                )

Because deny_extensions defaults to scrapy.DENY_EXTENSIONS which includes PDF files, I assumed it would not crawl a PDF link. But, I have internal links that are redirected to externally hosted PDF files.

Here are some extracts from logs with examples:

33: 2023-11-27 23:41:01 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf> (referer: https://west.usd262.net/about) More 
34: 2023-11-27 23:41:02 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx> (referer: https://west.usd262.net/about) More
35: 2023-11-27 23:41:05 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1676649887/usd262net/adlo2wuxxpqa7pmnxmkx/MiddleSchoolBellSchedule22_23docx.pdf> (referer: https://vcms.usd262.net/about) More 
36: 2023-11-27 23:41:10 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073617/usd262net/zjuysts6fymaf5gjumlc/VCMSStudentHandbook23-24Finaldocx.pdf> (referer: https://vcms.usd262.net/about) More

And here is a single trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
    yield next(it)
          ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
    return next(self.data)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
    return next(self.data)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in <genexpr>
    return (r for r in result or () if self._filter(r, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
    return (self._set_referer(r, response) for r in result or ())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
    return (r for r in result or () if self._filter(r, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
    return (r for r in result or () if self._filter(r, response, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/tmp/unpacked-eggs/__main__.egg/edtech/spiders/edcrawler.py", line 117, in parse
    ex_links = external_le.extract_links(response)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 239, in extract_links
    base_url = get_base_url(response)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/response.py", line 26, in get_base_url
    text = response.text[0:4096]
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/http/response/__init__.py", line 137, in text
    raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text

I've tried various approaches to change my link extractor but presumably the link looks fine to the link extractor. Its the redirect that has the PDF file which gets downloaded and produces the error.

Example start url start url

link on that page extracted into 'in_links' extracted internal link

redirect redirect to a pdf document on web host

The only thing I can think of to fix this issue is a custom middleware piece that replaces the redirect and looks for r".pdf$" in the request.url.

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

Am I missing something? using latest scrapy 2.11.0. also, logged issue on scrapy github github/6159.

1: scrapy docs.redirect middleware


Solution

  • I think your best option in this situation would be to subclass the RedirectMiddleware and simply add in a few lines that check the Location header of the initial response for the .pdf extension and raise the IgnoreRequest Exception if it is found.

    This can all be done in just a handful of lines.

    Example:

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
    from scrapy.exceptions import IgnoreRequest
    
    class PDFRedirect(RedirectMiddleware):
    
        def process_response(self, request, response, spider):
            location = response.headers.get("Location", b"").decode()
            if location.lower().endswith(".pdf") or location.lower().endswith(".docx"):
                print(f"IGNORING PDF {location}")
                raise IgnoreRequest("max redirections reached")
            return super().process_response(request, response, spider)
    
    
    class PdfRedirectSpider(scrapy.Spider):
        name = 'nopdfs'
        allowed_domains = ['west.usd262.net']
        start_urls = ['https://west.usd262.net/about']
    
        custom_settings = {
            "DOWNLOADER_MIDDLEWARES" : {
                "scrapy.downloadermiddlewares.redirect.RedirectMiddleware":None,
                PDFRedirect: 600,
            }
        }
    
        def parse(self, response):
            internal_le = LinkExtractor(unique=True)
            in_links = internal_le.extract_links(response)
            for link in in_links:
                    if link.url:
                        yield scrapy.Request(link.url, callback=self.parse)
    
    

    OUTPUT

    2023-11-30 15:00:35 [scrapy.core.engine] INFO: Spider opened
    2023-11-30 15:00:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2023-11-30 15:00:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2023-11-30 15:00:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about> (referer: None)
    2023-11-30 15:00:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://west.usd262.net/about> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.usd262.net': <GET https://www.usd262.net/staff-links1>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'abilene.usd262.net': <GET https://abilene.usd262.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'wheatland.usd262.net': <GET https://wheatland.usd262.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcis.usd262.net': <GET https://vcis.usd262.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcms.usd262.net': <GET https://vcms.usd262.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vchs.usd262.net': <GET https://vchs.usd262.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tlc.usd262.net': <GET https://tlc.usd262.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/profile.php?id=100061273524317>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/USD262>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.youtube.com': <GET https://www.youtube.com/channel/UCD8AdyKpM44gpFzqIqBG9tw>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-22-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-22-us-central1-01.preview.finalsitecdn.com/about/calendar1>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.finalsite.com': <GET https://www.finalsite.com>
    2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about#fsPageContent> (referer: https://west.usd262.net/about)
    IGNORING PDF https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx
    2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/privacy-policy> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/> (referer: https://west.usd262.net/about)
    IGNORING PDF https://resources.finalsite.net/images/v1686234716/usd262net/hdkhsv6qg1jzbobmkrxs/23-24elementaryschoolsupplylist8511in.pdf
    2023-11-30 15:00:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/contact645-clone> from <GET https://west.usd262.net/fs/pages/3813>
    2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/accessibility-statement> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.valleycenterhornets.net': <GET https://www.valleycenterhornets.net>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sideline.bsnsports.com': <GET https://sideline.bsnsports.com/schools/kansas/valleycenter/valley-center-high-school/design/picker>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-34-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-34-us-central1-01.preview.finalsitecdn.com/about>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'calendar.google.com': <GET https://calendar.google.com/calendar/embed?src=usd262.net_b07qmrijq7dq09a7s93u4qq7u0%40group.calendar.google.com&ctz=America%2FChicago>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'datacentral.ksde.org': <GET https://datacentral.ksde.org/accountability.aspx>
    IGNORING PDF https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.w3.org': <GET http://www.w3.org/TR/WCAG/>
    2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'accessibilitystatementgenerator.com': <GET http://accessibilitystatementgenerator.com>
    2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/parent756> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/pto> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/site-map> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/footer-links> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.infinitecampus.org': <GET https://usd262.infinitecampus.org/campus/portal/valleycenter.jsp>
    2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net.finalsite.com': <GET https://usd262net.finalsite.com/fs/resource-manager/view/383a8f18-5ef9-4f48-815e-030300759293>
    2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'docs.google.com': <GET https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRi840waukqIIVzL9eM4X9EoxwIsGKyuwsu83A852Mv6dMnPmjQSF0HKFRrMmpw1g/pubhtml>
    2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.incidentiq.com': <GET https://usd262.incidentiq.com/>
    2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'educatekansas.org': <GET https://educatekansas.org/>
    2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/volunteering> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/ymca-childcare> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'ymcawichita.org': <GET https://ymcawichita.org/programs/child-care-and-camps/before-and-after-school>
    2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/emergency-safety-interventions-bullying> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/librarymedia-center> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/volunteer-information> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'search.follettsoftware.com': <GET https://search.follettsoftware.com/metasearch/ui/43691>
    2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bookfairs.scholastic.com': <GET https://bookfairs.scholastic.com/bf/westelementaryschool11>
    2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.commonsensemedia.org': <GET https://www.commonsensemedia.org/>
    2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/news> from <GET https://west.usd262.net/fs/pages/3814>
    IGNORING PDF https://resources.finalsite.net/images/v1680193574/usd262net/skenieqeiwealjrpl210/33023ActivationInstructionforCampusPortal3.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673804004/usd262net/i0mi93dw4rp63jsem0jt/PTOMeetingMinutes1220docx.pdf
    2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/contact645-clone> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/sraff-directory> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/schools> from <GET https://west.usd262.net/fs/pages/2799>
    IGNORING PDF https://resources.finalsite.net/images/v1673803989/usd262net/bvokssior5jikny5ggwk/PTOMeetingMinutes2120docx.pdf
    2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/report-bullying-safety-concerns> from <GET https://west.usd262.net/fs/pages/3560>
    2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/counseling> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.ksde.org': <GET http://www.ksde.org/Default.aspx?tabid=149>
    2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.homeworkkansas.org': <GET http://www.homeworkkansas.org/>
    IGNORING PDF https://resources.finalsite.net/images/v1673803943/usd262net/okuntylyovx2hn260gmt/PTOMeetingMinutes1919docx.pdf
    2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/nurses-page> (referer: https://west.usd262.net/about)
    2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kidshealth.org': <GET http://www.kidshealth.org/parent/firstaid_safe/>
    IGNORING PDF https://resources.finalsite.net/images/v1673803972/usd262net/s8sipel9qrbd1kwqrklg/FebPTOMeetingMinutes1120docx.pdf
    2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/document-library> (referer: https://west.usd262.net/about)
    IGNORING PDF https://resources.finalsite.net/images/v1673803909/usd262net/zcygtqo4nk94alxapei2/PTOMeetingMinutes1719.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803957/usd262net/kpvyrmpdxbbwic1o9mkw/1-21-20PTOMeetingMinutes21201docx.pdf
    2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/administration> (referer: https://west.usd262.net/about)
    IGNORING PDF https://resources.finalsite.net/images/v1673803928/usd262net/aprcr3g9v0x76agcz81m/PTOMeetingMinutes2219docx.pdf
    2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://west.usd262.net/about/sraff-directory> from <GET https://west.usd262.net/staff-directory>
    2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://west.usd262.net> from <GET https://west.usd262.net/fs/resource-manager/view/446cdd83-e743-495f-b0f1-91318deef052>
    IGNORING PDF https://resources.finalsite.net/images/v1673803888/usd262net/sun8frlao9rk4gftotnp/PTOMeetingMinutes2719.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803867/usd262net/cev2livmjpacfgyq0qrc/4202021PTOMeetingminutes.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803137/usd262net/km3nodsbggl5taziszk3/MicrosoftWord-TotallyCoolElementarySchool_1.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803819/usd262net/k5xboy8whfnymanvvuyk/MeetingminutesFeb.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803121/usd262net/u5ctbelnubnhgz9gw6wa/WestElementaryCounselingBrochurefinal-2008_1.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673784917/usd262net/lzwphtnhcoqjp9thds6n/FactSheet-TitleI-ParentInvolvement.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803778/usd262net/rwb5tlbdaap8e1wjiizl/NovemberPTOMeetingMinutes.pdf
    2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/enrollment/student-health-information> from <GET https://west.usd262.net/fs/pages/3541>
    IGNORING PDF https://resources.finalsite.net/images/v1673784914/usd262net/dkc6smzfpcylihjyl0mx/ESIBoardPolicies-19.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803487/usd262net/eyojl1bd1qdj3lp8bjki/RICE-RestIceCompresionElevation_1.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673784913/usd262net/mlrg9xwsotm3a6ccmazy/ESI-DocumentsforWebsite-19.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803799/usd262net/rczvldr6kah713hisfwx/JanuaryPTOMeetingMinutes.pdf
    2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/report-bullying-safety-concerns> (referer: https://west.usd262.net/about/emergency-safety-interventions-bullying)
    IGNORING PDF https://resources.finalsite.net/images/v1673784915/usd262net/u4efohzm82jnbzzsqxd3/FERPANotificationofRights.pdf
    2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.p3tips.com': <GET https://www.p3tips.com/tipform.aspx?ID=217>
    2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.crisistextline.org': <GET https://www.crisistextline.org/texting-in/>
    2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kbi.ks.gov': <GET https://www.kbi.ks.gov/sar>
    2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.onlinesafetyhub.io': <GET https://usd262.onlinesafetyhub.io/>
    IGNORING PDF https://resources.finalsite.net/images/v1673803764/usd262net/yswpmxj1ivn5dr4onfue/OctoberPTOmeeting.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803749/usd262net/cfcylzqvhzvvsacorltx/SeptemberPTOmeetingnotes.pdf
    2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/news> (referer: https://west.usd262.net/)
    2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://west.usd262.net/fs/pages/3508> (referer: https://west.usd262.net/about/document-library)
    2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/schools> (referer: https://west.usd262.net/)
    IGNORING PDF https://resources.finalsite.net/images/v1673803706/usd262net/doln0ockhdm39lkfntxm/NovPTOmeetingminutes162021.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803735/usd262net/zltnhhnyt2jz1fi8k8gy/MarchPTOMeetingMinutes222022.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803693/usd262net/fel78cnko0opxf96lefx/OctPTOMeetingminutes192021.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803721/usd262net/idowphs1sgrl2xrnellg/JanPTOMeetingMinutes1820221.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803679/usd262net/e6jc2mep0odspayjzxmo/SeptPTOMeetingMinutes.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803649/usd262net/eultmjehz33n29yf5nqt/PTOMeetingMinutes2020221.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803664/usd262net/x6uh9b9s0lxpm3h8nmdx/AugustthPTOMinutes.pdf
    IGNORING PDF https://resources.finalsite.net/images/v1673803634/usd262net/izumknwsghgbzuouu4ui/PTOMeetingMinutes2320221.pdf
    2023-11-30 15:00:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/enrollment/student-health-information> (referer: https://west.usd262.net/about/nurses-page)
    2023-11-30 15:00:44 [scrapy.core.engine] INFO: Closing spider (finished)
    2023-11-30 15:00:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 38365,
     'downloader/request_count': 65,
     'downloader/request_method_count/GET': 65,
     'downloader/response_bytes': 248536,
     'downloader/response_count': 65,
     'downloader/response_status_count/200': 24,
     'downloader/response_status_count/301': 6,
     'downloader/response_status_count/302': 34,
     'downloader/response_status_count/404': 1,
     'dupefilter/filtered': 402,
     'elapsed_time_seconds': 8.931808,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2023, 11, 30, 23, 0, 44, 907376),
     'httpcompression/response_bytes': 795436,
     'httpcompression/response_count': 25,
     'log_count/DEBUG': 69,
     'log_count/INFO': 10,
     'offsite/domains': 35,
     'offsite/filtered': 962,
     'request_depth_max': 3,
     'response_received_count': 25,
     'scheduler/dequeued': 65,
     'scheduler/dequeued/memory': 65,
     'scheduler/enqueued': 65,
     'scheduler/enqueued/memory': 65,
     'start_time': datetime.datetime(2023, 11, 30, 23, 0, 35, 975568)}