I'm new to Python and would like your advice for the issue I've encountered recently. I'm doing a small project where I tried to scrape a comic website to download a chapter (pictures). However, when printing out the page content for testing (because i tried to use Beautifulsoup.select() and got no result), it only showed a line of html:
'document.cookie="VinaHost-Shield=a7a00919549a80aa44d5e1df8a26ae20"+"; path=/";window.location.reload(true);'
Any help would be really appreciated.
from requests_html import HTMLSession
session = HTMLSession()
res = session.get("https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html")
res.html.render()
print(res.content)
I also tried this but the resutl was the same.
import requests, bs4
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get(url, headers={"User-Agent": "Requests"})
res.raise_for_status()
# soup = bs4.BeautifulSoup(res.text, "html.parser")
# onePiece = soup.select(".page-chapter")
print(res.content)
update: I installed docker and splash (on Windows 11) and it worked. I included the update code. Thanks Franz and others for yor help.
import os
import requests, bs4
os.makedirs("OnePiece", exist_ok=True)
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get("http://localhost:8050/render.html", params={"url": url, "wait": 5})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
onePiece = soup.find_all("img", class_="lazy")
for element in onePiece:
imageLink = "https:" + element["data-cdn"]
res = requests.get(imageLink)
imageFile = open(os.path.join("OnePiece", os.path.basename(imageLink)), "wb")
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
This response means that we need a javascript render that reload the page using this cookie. for you get the content some workaround must be added.
I commonly use splash scrapinhub render engine and putting a sleep in the page just renders ok all the content. Some tools that render in same way are selenium for python or pupitter in JS.