I am using the following snippet to get links from the google search results for the "keyword" I give.
import mechanize
from bs4 import BeautifulSoup
import re
def googlesearch():
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0')]
br.open('http://www.google.com/')
# do the query
br.select_form(name='f')
br.form['q'] = 'scrapy' # query
data = br.submit()
soup = BeautifulSoup(data.read())
for a in soup.find_all('a', href=True):
print "Found the URL:", a['href']
googlesearch()
Since am parsing the search results HTML page to get links.Its getting all the 'a' tags.But what I need is to get only the links for the results.Another thing is when you see the output of the href attribute it gives something like this
Found the URL: /search?q=scrapy&hl=en-IN&gbv=1&prmd=ivns&source=lnt&tbs=li:1&sa=X&ei=DT8HU9SlG8bskgWvqIHQAQ&ved=0CBgQpwUoAQ
But the actual link present in href attitube is http://scrapy.org/
Can anyone point me the solution for the above two questions mentioned above??
Thanks in advance
The links you're interested in are inside the h3
tags (with r
class):
<li class="g">
<h3 class="r">
<a href="/url?q=http://scrapy.org/&sa=U&ei=XdIUU8DOHo-ElAXuvIHQDQ&ved=0CBwQFjAA&usg=AFQjCNHVtUrLoWJ8XWAROG-a4G8npQWXfQ">
<b>Scrapy</b> | An open source web scraping framework for Python
</a>
</h3>
..
You can find the links using css selector:
soup.select('.r a')
URLs are in the following format:
/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ
^^^^^^^^^^^^^^^^^^^^
Actual url is in the q
parameter.
To get the the entire query string, use urlparse.urlparse
:
>>> url = '/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
>>> urlparse.urlparse(url).query
'q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
Then, use urlparse.parse_qs
to parse the query string and extract the q
parameter value:
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q']
['http://scrapy.org/']
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q'][0]
'http://scrapy.org/'
for a in soup.select('.r a'):
print urlparse.parse_qs(urlparse.urlparse(a['href']).query)['q'][0]
output:
http://scrapy.org/
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://doc.scrapy.org/
http://scrapy.org/download/
http://doc.scrapy.org/en/latest/intro/overview.html
http://scrapy.org/doc/
http://scrapy.org/companies/
https://github.com/scrapy/scrapy
http://en.wikipedia.org/wiki/Scrapy
http://www.youtube.com/watch?v=1EFnX1UkXVU
https://pypi.python.org/pypi/Scrapy
http://pypix.com/python/build-website-crawler-based-upon-scrapy/
http://scrapinghub.com/scrapy-cloud