Search code examples
pythongoogle-crawlers

Crawling links from Google


I am trying to crawl links of related field i.e computer science but on the way i am getting some very strange output links. Even when i try to open those links in web browser it shows page not found.

Here is the code:

from bs4 import BeautifulSoup
import requests

a = input("search:")
page = requests.get("https://www.google.dz/search?q="+a)
soup = BeautifulSoup(page.content)
links = soup.findAll("a")
for link in  links:
    if link['href'].startswith('/url?q='):
        print (link['href'].replace('/url?q=',''),'\n')
      #  f = open('links.txt','a+')
       # f.write(link['href'].replace('/url?q=',''))
       # f.close()

And output:

search:"data"
('http://www.zdnet.fr/actualites/data-lakes-ne-les-confondez-pas-avec-un-data-warehouse-39832052.htm&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQqQIIFTAA&usg=AFQjCNFZzS0E1EDF51VtLq-KWuxvg2HPeg', '\n')
('http://www.journaldugeek.com/2016/02/01/microsoft-planche-sur-des-data-centers-sous-marins/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQqQIIFzAB&usg=AFQjCNGjc0-ev9X5MigD0-mzSx0zr5-6Qw', '\n')
('http://www.01net.com/actualites/microsoft-veut-noyer-vos-donnees-et-ses-data-centers-en-pleine-mer-947974.html&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQqQIIGTAC&usg=AFQjCNEB9fsmDeARKnjwjyfe90bpJwJWcA', '\n')
('http://rmsnews.com/big-data-recrutement-par-jean-christophe-anna/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IHTAD&usg=AFQjCNEc125DUcwyX9QTCNus0hBRsFS6DA', '\n')
('http://bolin.su.se/data/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IHzAE&usg=AFQjCNEwuKR9IlFHwCgNQagBZt8NN8M9Iw', '\n')
('http://birt.actuate.com/products/ihub/data-access&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IITAF&usg=AFQjCNFAGC79QVuHPrw7M9pzzC7Jh_EYSw', '\n')
('http://www.lepoint.fr/technologie/video-le-big-data-jusqu-ou-18-03-2015-1913631_58.php&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQwW4IIzAG&usg=AFQjCNF_j4WlW_axSMjtpiONdh6OjlEaMQ', '\n')
('https://fr.wikipedia.org/wiki/Donn%25C3%25A9e&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgglMAc&usg=AFQjCNELfR-1pSA9e4KyzDCBx8SVtkMvyg', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:zXVlfFTefbsJ:https://fr.wikipedia.org/wiki/Donn%2525C3%2525A9e%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAgoMAc&usg=AFQjCNGuPAXHAqtRMSB8l7D9DoOFn3Ta4g', '\n')
('https://en.wikipedia.org/wiki/Data&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFggqMAg&usg=AFQjCNHIINpuNGYzYlOWVUb628dcSnownw', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:n6Ofwm3_TzIJ:https://en.wikipedia.org/wiki/Data%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAgtMAg&usg=AFQjCNF8fbDR6kGbFRPBzkz20ZpjXE23JA', '\n')
('https://en.wikipedia.org/wiki/Data_(disambiguation)&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIILygAMAg&usg=AFQjCNGK7coMxJqmsREt19hEmLWR6QW4Ow', '\n')
('https://en.wikipedia.org/wiki/Data_(computing)&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIIMCgBMAg&usg=AFQjCNEudeiCi_0HFgdzj0KnJRhxIRPRPA', '\n')
('https://en.wikipedia.org/wiki/Metadata&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIIMSgCMAg&usg=AFQjCNFRY05jK0c4QakO-YFoTvPfn013IQ', '\n')
('https://en.wikipedia.org/wiki/Data_analysis&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQ0gIIMigDMAg&usg=AFQjCNEwtBoC4KyGymoijiJUcYfkgr1p6w', '\n')
('https://fr.wikipedia.org/wiki/Data_(homonymie)&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgg0MAk&usg=AFQjCNHxzrXByg4-rj2zllD2MCnkTDWe0g', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:chZvlvbLIsIJ:https://fr.wikipedia.org/wiki/Data_(homonymie)%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAg3MAk&usg=AFQjCNEI0IGMlEht_Lc1l6aftJ2ZThbgEg', '\n')
('https://www.youtube.com/user/datagueule&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgg5MAo&usg=AFQjCNHgpxg20cdG4wnoULcRirJJtNurJA', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:OMKbWLSVB4QJ:https://www.youtube.com/user/datagueule%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAg8MAo&usg=AFQjCNEybm3Zwr346unQx-7oTk92Vq_V9g', '\n')
('http://www.data.com/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFgg_MAs&usg=AFQjCNE_K3RocyeXQFhYWa4tlNL19sKAXQ', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:HikntWD5aqMJ:http://www.data.com/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhCMAs&usg=AFQjCNGcB8SlqjU0tsxSEmJ9Bcgp70hAcw', '\n')
('http://data.worldbank.org/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFghFMAw&usg=AFQjCNH2NwwJkUkGvN6oCOGVSJ4OIolarw', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:BuQHDbbGLT0J:http://data.worldbank.org/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhIMAw&usg=AFQjCNFHNBkzsNR71hTX9t3rNwbGbrMxdw', '\n')
('http://data.bnf.fr/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFghLMA0&usg=AFQjCNEvZ5gWO0hOX_PQFj3eYUv3OdMXMA', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:z8MGwIoF1bkJ:http://data.bnf.fr/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhOMA0&usg=AFQjCNGTJvbzKA1PEa3jH9fa-bizChljhA', '\n')
('https://www.facebook.com/0data0/&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQFghRMA4&usg=AFQjCNEwUYPG6WJvzbaU2lwk8-2z9398_Q', '\n')
('http://webcache.googleusercontent.com/search%3Fq%3Dcache:rT9_WJoHdrYJ:https://www.facebook.com/0data0/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhUMA4&usg=AFQjCNETnGcEqlHE7wmGT7AEgTweVMHBqw', '\n')

For example i place link on browser:

http://webcache.googleusercontent.com/search%3Fq%3Dcache:rT9_WJoHdrYJ:https://www.facebook.com/0data0/%252Bdata%26hl%3Dfr%26%26ct%3Dclnk&sa=U&ved=0ahUKEwjG9ZnI0dnKAhXBao4KHWe3DoYQIAhUMA4&usg=AFQjCNETnGcEqlHE7wmGT7AEgTweVMHBqw

And browser showed me: enter image description here

I am asking because as a general user when type on Google something it gives us the link that sends us to the page which we need it whereas through there i am not successfully reaching there. (i am also intending to save on file but it also showing very messy and not understandable). I dont know how to implement parsing correctly....?


Solution

  • use the following condition

    #your code
    if link['href'].startswith('/url?q=') \
        and 'webcache.googleusercontent.com' not in link['href']:
        print link['href'].split('/url?q=')[1].split('&')[0]
        #your code