Search code examples
pythonregexurllib

Web scraping: read all href


I write a small script to read all hrefs from web page with python. But it has a problem. It doesn't read href="pages.php?ef=fa&page=n_fullstory.php&NewsIDn=1648" for example.

code:

import urllib
import re

urls = ["http://something.com"]

regex='href=\"(.+?)\"'
pattern = re.compile(regex)

htmlfile = urllib.urlopen(urls[0])
htmltext = htmlfile.read()
hrefs = re.findall(pattern,htmltext)
print hrefs

Can anybody help me? Thanks.


Solution

  • use BEautifulSoup and requests for static websites. it is a great module for web scraping, use the code and easily you can get the value inside the href tag. hope it helps

    import requests
    from bs4 import BeautifulSoup
    
    url = 'whatever url you want to parse'
    
    result = requests.get(url)
    
    soup = BeautifulSoup(result.content,'html.parser')
    
    for a in soup.find_all('a',href=True):
        print "Found the URL:", a['href']