I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src".
SRC:
from bs4 import BeautifulSoup
import requests
scraper = cfscrape.create_scraper()
url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206"
response = requests.get(url)
soup2 = BeautifulSoup(response.text, 'html.parser')
divImage = soup2.find('div',{"id": "divImage"})
for img in divImage.findAll('img'):
print(img)
response.close()
I think image scraping is prevented because I believe the website uses cloudflare. Upon this assumption, I also tried using the "cfscrape" library to scrape the content.
You need to wait for JavaScript
to inject the html
code for images.
Multiple tools are capable of doing this, here are some of them:
I was able to get it working with Selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
driver = webdriver.Firefox()
# it takes forever to load the page, therefore we are setting a threshold
driver.set_page_load_timeout(5)
try:
driver.get("http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206")
except TimeoutException:
# never ignore exceptions silently in real world code
pass
soup2 = BeautifulSoup(driver.page_source, 'html.parser')
divImage = soup2.find('div', {"id": "divImage"})
# close the browser
driver.close()
for img in divImage.findAll('img'):
print img.get('src')
Refer to How to download image using requests if you also want to download these images.