This URL with query string for the zip code loads the search results correctly in the browser:
https://www.psychotherapy.org.uk/find-a-therapist/?Location=M3%201AR&Distance=10&page=7
Each search result has its own h2 tag. In scrapy shell I get a 200 response but the only html that scrapy will get is things like header, footer, menu etc, ie ignoring the search results html.
Below is an example for the h2 tag but it is the same for any tag.
Any explanations or please?
In [1]: fetch('https://www.psychotherapy.org.uk/find-a-therapist/?Location=M3%201AR&Distance=10&page=7')
2024-04-12 15:45:28 [scrapy.core.engine] INFO: Spider opened
2024-04-12 15:45:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fwww.psychotherapy.org.uk%2Frobots.txt> (referer: None)
2024-04-12 15:45:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fproxy.scrapeops.io%2Frobots.txt> (referer: None)
2024-04-12 15:45:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=e3486f25-6d19-4663-b876-35d01b822096&url=https%3A%2F%2Fwww.psychotherapy.org.uk%2Ffind-a-therapist%2F%3FLocation%3DM3%25201AR%26Distance%3D10%26page%3D7> (referer: None)
In [2]: response.css('h2').getall()
Out[2]:
['<h2>Refine your search</h2>',
'<h2>Looking for a specific therapist?</h2>',
'<h2>\r\n <span class="sub">Bookmarks</span>\r\n My Shortlist\r\n </h2>',
'<h2>Contact us</h2>',
'<h2>Links</h2>',
'<h2>Connect with us</h2>']
In [3]:
As Lakshmanarao Simhadri pointed out there is another POST request upon page loading. Checking the network tab it's being sent here: https://www.psychotherapy.org.uk/umbraco/Surface/SearchSurface/Search
You need to provide form data to this second POST request (this can be retrieved from the network tab as well). Furthermore you can use the FormRequest
class from scrapy to assemble the request.
The following sample can be run through scrapy shell:
form_data = {
"HelpWith": "",
"InPerson": "false",
"Remote": "false",
"Location": "M3+1AR",
"Pager.CurrentPage": "7",
"KeywordFilter": "",
"Distance": "10",
"LocationSearchOutsideUK": "false",
"OnlyProfilesWithPhotos": "false",
"OnlyWheelchairAccessible": "false",
"OrderSeed": "2107061618",
"X-Requested-With": "XMLHttpRequest"
}
req = scrapy.FormRequest(url="https://www.psychotherapy.org.uk/umbraco/Surface/SearchSurface/Search", formdata=form_data)
fetch(req)
response.css(".profile-listing h2::text").getall()
The above css query should print the name of therapists.