Search code examples
pythonweb-scrapingbeautifulsouppython-requestscloudflare

Python Scraper Unable to scrape img src


I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src".

SRC:

from bs4 import BeautifulSoup
import requests

scraper = cfscrape.create_scraper()

url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206"

response = requests.get(url)

soup2 = BeautifulSoup(response.text, 'html.parser')

divImage = soup2.find('div',{"id": "divImage"})

for img in divImage.findAll('img'):
     print(img)

response.close()

I think image scraping is prevented because I believe the website uses cloudflare. Upon this assumption, I also tried using the "cfscrape" library to scrape the content.


Solution

  • You need to wait for JavaScript to inject the html code for images.

    Multiple tools are capable of doing this, here are some of them:

    I was able to get it working with Selenium:

    from bs4 import BeautifulSoup
    
    from selenium import webdriver
    from selenium.common.exceptions import TimeoutException
    
    driver = webdriver.Firefox()
    # it takes forever to load the page, therefore we are setting a threshold
    driver.set_page_load_timeout(5)
    
    try:
        driver.get("http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206")
    except TimeoutException:
        # never ignore exceptions silently in real world code
        pass
    
    soup2 = BeautifulSoup(driver.page_source, 'html.parser')
    divImage = soup2.find('div', {"id": "divImage"})
    
    # close the browser 
    driver.close()
    
    for img in divImage.findAll('img'):
        print img.get('src')
    

    Refer to How to download image using requests if you also want to download these images.