Search code examples
pythonregexbeautifulsoupparse-url

Parse multiple URLs and extract data


  1. I need to parse a HTML page, get all the URLs meeting my requirement.
  2. Now, I need to parse each of the extracted URLs to get the data that I want, if the page title matches something and save them to multiple files based on their names. I have done part 1 in the following way.

    pattern=re.compile(r'''class="topline"><A href="(.*?)"''')
    da = pattern.search(web_page)
    da = pattern.findall(soup1)
    col_width = max(len(word) for row in da for word in row)
    for row in da:
        if "some string" in row.upper():
            bb = "".join(row.ljust(col_width))
            print >> links, bb
    

I'd truly appreciate any help. Thank you.


Solution

  • First of all, do not parse HTML with regex. You've actually marked the question with BeautifulSoup tag, but you are still using regular expressions here.

    Here's how you can get the links, follow them and check the title:

    from urllib2 import urlopen
    from bs4 import BeautifulSoup
    
    URL = "url here"
    
    soup = BeautifulSoup(urlopen(URL))
    links = soup.select('.topline > a')
    for a in links:
        link = link.get('href')
        if link:
            # follow link
            link_soup = BeautifulSoup(urlopen(link))
            title = link_soup.find('title')
            # check title
    

    .topline > a CSS selector would find you any tag with topline class and get the a tag right beneath.

    Hope that helps.