Search code examples
pythonlxmlpython-requestsscrapepyquery

Python Scrape website with Requests and lxml..


Using this as a starting point.. http://docs.python-guide.org/en/latest/scenarios/scrape/

from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.text)

Everything works as expected..But,....

from lxml import html
import requests

page = requests.get('http://www.streetinsider.com/ipo_history.php?type=upcoming')
tree = html.fromstring(page.text)

Gives this error...

File "<string>", line unknown
XMLSyntaxError: line 1: Document is empty

Using pyquery....

from pyquery import PyQuery as pq
from lxml import etree,html
import requests


response = pq(url='http://www.streetinsider.com/ipo_history.php?type=upcoming')

doc = pq(response.content)

throws this error...

File "<string>", line unknown
XMLSyntaxError: line 1504: Unexpected end tag : h2

Any help getting the table from the webpage.


Solution

  • Some website detects and blocks certain user-agents. (something like web robots.) Web-app behind www.streetinsider.com seems to detect user-agent of python-requests, and (passively) blocks its HTTP request.

    You may set user-aget using headers={'User-Agent': ''} parameter of requests.get function call.

    page = requests.get('http://www.streetinsider.com/ipo_history.php', \
                        headers={'User-Agent': 'tester'}, \
                        params={'type':'upcoming'})