Now, I need to parse each of the extracted URLs to get the data that I want, if the page title matches something and save them to multiple files based on their names. I have done part 1 in the following way.
pattern=re.compile(r'''class="topline"><A href="(.*?)"''')
da = pattern.search(web_page)
da = pattern.findall(soup1)
col_width = max(len(word) for row in da for word in row)
for row in da:
if "some string" in row.upper():
bb = "".join(row.ljust(col_width))
print >> links, bb
I'd truly appreciate any help. Thank you.
First of all, do not parse HTML with regex. You've actually marked the question with BeautifulSoup
tag, but you are still using regular expressions here.
Here's how you can get the links, follow them and check the title
:
from urllib2 import urlopen
from bs4 import BeautifulSoup
URL = "url here"
soup = BeautifulSoup(urlopen(URL))
links = soup.select('.topline > a')
for a in links:
link = link.get('href')
if link:
# follow link
link_soup = BeautifulSoup(urlopen(link))
title = link_soup.find('title')
# check title
.topline > a
CSS selector would find you any tag with topline
class and get the a
tag right beneath.
Hope that helps.