Search code examples
pythondataframerss-reader

Python Parsing HTML from url into PD ValueError: No tables found


I'm trying to parse the below HTML into a dataframe and i keep getting error, eventhough i can clearly see a table defined in the HTML. Appreciate your help

<table><tr><td><a <table><tr><td><a 

Error

ValueError: No tables found

My code

import pandas as pd 
url='http://rssfeeds.s3.amazonaws.com/goldbox?'
#dfs = pd.read_html(requests.get(url).text)
dfs = pd.read_html(url)
dfs[0].head()

Also tried with feedparser and no luck. I dont get any data

import feedparser
import pandas as pd
import time

rawrss = ('http://rssfeeds.s3.amazonaws.com/goldbox')
    
posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.dealUrl, post.discountPercentage))
df = pd.DataFrame(posts, columns=['title', 'dealUrl', 'discountPercentage'])
df.tail()

Solution

  • The amount of data on this page is too large to time out. In addition, the content I got seems to be different from yours.

    import pandas as pd
    from simplified_scrapy import SimplifiedDoc, utils, req
    html = req.get('http://rssfeeds.s3.amazonaws.com/goldbox',
                   timeout=600)
    
    posts = {'title': [], 'link': [], 'description': []}
    doc = SimplifiedDoc(html)
    items = doc.selects('item')
    for item in items:
        posts['title'].append(item.title.text)
        posts['link'].append(item.link.text)
        posts['description'].append(item.description.text)
    
    df = pd.DataFrame(posts)
    df.tail()
    

    Get data from description

    posts = {'listPrice': [], 'dealPrice': [], 'expires': []}
    doc = SimplifiedDoc(html)
    descriptions = doc.selects('item').description # Get all descriptions
    for table in descriptions:
        d = SimplifiedDoc(table.unescape()) # Using description to build a doc object
        img = d.img.src # Get the image src
        listPrice = d.getElementByText('List Price:')
        if listPrice:
            listPrice=listPrice.strike.text
        else: listPrice = ''
    
        dealPrice = d.getElementByText('Deal Price: ')
        if dealPrice:
            dealPrice = dealPrice.text[len('Deal Price: '):]
        else: dealPrice = ''
    
        expires = d.getElementByText('Expires ')
        if expires:
            expires = expires.text[len('Expires '):]
        else: expires = ''
    
        posts['listPrice'].append(listPrice)
        posts['dealPrice'].append(dealPrice)
        posts['expires'].append(expires)
    df = pd.DataFrame(posts)
    df.tail()
    

    The page data I get is as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
      <channel>
        <title>Amazon.com Gold Box Deals</title>
        <link>http://www.amazon.com/gp/goldbox</link>
        <description>Amazon.com Gold Box Deals</description>
        <pubDate>Thu, 28 Jun 2018 08:50:16 GMT</pubDate>
        <dc:date>2018-06-28T08:50:16Z</dc:date>
        <image>
          <title>Amazon.com Gold Box Deals</title>
          <url>http://images.amazon.com/images/G/01/rcm/logo2.gif</url>
          <link>http://www.amazon.com/gp/goldbox</link>
        </image>
        <item>
          <title>Deal of the Day: Withings Activit? Steel - Activity and Sleep Tracking Watch</title>
          <link>https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20</link>
          <description>&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;a href="https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20" target="_blank"&gt;&lt;img src="https://images-na.ssl-images-amazon.com/images/I/41O4Qc3FCBL._SL160_.jpg" alt="Product Image" style='border:0'/&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tr&gt;&lt;td&gt;Withings Activit? Steel - Activity and Sleep Tracking Watch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Expires Jun 29, 2018&lt;/td&gt;&lt;/tr&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</description>
          <pubDate>Thu, 28 Jun 2018 07:00:10 GMT</pubDate>
          <guid isPermaLink="false">http://promotions.amazon.com/gp/goldbox/</guid>
          <dc:date>2018-06-28T07:00:10Z</dc:date>
        </item>