I am attempting to scrape product information from lowes.com. My test is specifically this product AirStone 8-sq ft Autumn Mountain Faux Stone Veneer. When I visit the page without JavaScript enabled ( to ensure I'm not seeing stuff that urllib / requests may not pick up on ) I clearly get a price for the item yet when I attempt to use either package above I am missing several sections of the web page.
It just so happens those sections are the sections I need for scraping ( price information specifically, everything else magically is still available ). I'd prefer not to use selenium for speeds sake. My current usage for both Requests and urllib look thusly
Common Items
from urlopen import Request, urlopen
import requests # switch as needed with urlopen
import gzip # manual deflation required with Request object urlopen or so I've found
url = "https://www.lowes.com/pd/AirStone-8-sq-ft-Autumn-Mountain-Faux-Stone-Veneer/50247201"
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.8",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
# "Host": "www.lowes.com", Tried, no difference
"Pragma": "no-cache",
# "Referer": "https://www.lowes.com/", Tried, no difference
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1 Win64 x64) AppleWebKit/537.36 (KHTML,"
" like Gecko) Chrome/59.0.3071.115 Safari/537.36" # <=- Tried placing all on one line, didn't make a difference
}
Urlopen
req = Request(url, None, headers)
page = gzip.decompress(urlopen(req).read()).decode('utf-8')
with open("content.txt", "w") as f:
f.write(page) # <=- missing the 59.97 price tag anywhere in the document :(
Requests
sessions = requests.Session()
page = sessions.get(self.url, headers=headers)
with open("content.txt", "w") as f:
f.write(page) # <=- Also missing the 59.97 price tag anywhere in the document :'(
So question is, am I missing something? Is there a reason for this to be missing? It isn't javascript related as I intentionally disable it before trying to scrape the data as I saw that was an issue a lot of the time.
Any help would be greatly appreciated.
Per the comment from jasonharper. Cookies ended up being the answer. Finding the right one allowed me to extract all the data necessary.
In short, always disable / delete cookies before trying to scrape a website if for no other reason than to make sure you see what the script sees.
For those curious the specific cookie is {"sn": "####"} ( store number ) You can simply pick a store and hover over it with javascript enabled look at the url it's linked to to find out the store number. Change to suit