Search code examples
pythonextractpython-newspapernewspaper3k

Extract image using Newspaper from HTML


I can't download articles like one usually does to instantiate the Article object, like below:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image

However, I can get the HTML from a request. Can I use this raw HTML and pass it somehow to Newspaper to extract the image from it? (below is an attempt, but doesn't work). Thanks

from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image

Solution

  • The Python module Newspaper allows proxies to be used, but this feature is not listed within the module's documentation.


    Proxies with Newspaper

    from newspaper import Article
    from newspaper.configuration import Configuration
    
    # add your corporate proxy information and test the connection
    PROXIES = {
               'http': "http://ip_address:port_number",
               'https': "https://ip_address:port_number"
              }
    
    config = Configuration()
    config.proxies = PROXIES
    
    url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
    articles = Article(url, config=config)
    articles.download()
    articles.parse()
    print(articles.top_image)
    https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
    

    Requests with Proxies and Newspaper

    import requests
    from newspaper import Article
    
    url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
    raw_html = requests.get(url, verify=False, proxies=proxy)
    article = Article('')
    article.download(raw_html.content)
    article.parse()
    print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg