python python-2.7 web-scraping beautifulsoup mechanize

Not Getting proper links from google search results using mechanize and Beautifulsoup

I am using the following snippet to get links from the google search results for the "keyword" I give.

import mechanize
from bs4 import BeautifulSoup
import re


def googlesearch():
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.set_handle_equiv(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0')] 
    br.open('http://www.google.com/')   

    # do the query
    br.select_form(name='f')   
    br.form['q'] = 'scrapy' # query
    data = br.submit()
    soup = BeautifulSoup(data.read())
    for a in soup.find_all('a', href=True):
        print "Found the URL:", a['href']
googlesearch()

Since am parsing the search results HTML page to get links.Its getting all the 'a' tags.But what I need is to get only the links for the results.Another thing is when you see the output of the href attribute it gives something like this

Found the URL: /search?q=scrapy&hl=en-IN&gbv=1&prmd=ivns&source=lnt&tbs=li:1&sa=X&ei=DT8HU9SlG8bskgWvqIHQAQ&ved=0CBgQpwUoAQ

But the actual link present in href attitube is http://scrapy.org/

Can anyone point me the solution for the above two questions mentioned above??

Thanks in advance

Solution

Get only the links for the results

The links you're interested in are inside the h3 tags (with r class):

<li class="g">
  <h3 class="r">
    <a href="/url?q=http://scrapy.org/&amp;sa=U&amp;ei=XdIUU8DOHo-ElAXuvIHQDQ&amp;ved=0CBwQFjAA&amp;usg=AFQjCNHVtUrLoWJ8XWAROG-a4G8npQWXfQ">
      <b>Scrapy</b> | An open source web scraping framework for Python
    </a>
  </h3>
  ..

You can find the links using css selector:

soup.select('.r a')

Get the actual link

URLs are in the following format:

/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ
     ^^^^^^^^^^^^^^^^^^^^

Actual url is in the q parameter.

To get the the entire query string, use urlparse.urlparse:

>>> url = '/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
>>> urlparse.urlparse(url).query
'q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'

Then, use urlparse.parse_qs to parse the query string and extract the q parameter value:

>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q']
['http://scrapy.org/']
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q'][0]
'http://scrapy.org/'

Final result

for a in soup.select('.r a'):
    print urlparse.parse_qs(urlparse.urlparse(a['href']).query)['q'][0]

output:

http://scrapy.org/
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://doc.scrapy.org/
http://scrapy.org/download/
http://doc.scrapy.org/en/latest/intro/overview.html
http://scrapy.org/doc/
http://scrapy.org/companies/
https://github.com/scrapy/scrapy
http://en.wikipedia.org/wiki/Scrapy
http://www.youtube.com/watch?v=1EFnX1UkXVU
https://pypi.python.org/pypi/Scrapy
http://pypix.com/python/build-website-crawler-based-upon-scrapy/
http://scrapinghub.com/scrapy-cloud