Python newbie here. Python 2.7 with beautifulsoup 3.2.1.
I'm trying to scrape a table from a simple page. I can easily get it to print, but I can't get it to return to my view function.
The following works:
@app.route('/process')
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
print table
return 'All good'
I can also return html
successfully. But when I try to return table
instead of return 'All good'
I get the following error:
TypeError: ResultSet object is not an iterator
I also tried:
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
out = []
for row in table.findAll('tr'):
colvals = [col.text for col in row.findAll('td')]
out.append('\t'.join(colvals))
return table
With no success. Any suggestions?
You're trying to return an object, you're not actually getting the text of the object so return table.text
should be what you are looking for. Full modified code:
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
return table.text
EDIT:
Since I understand now that you want the HTML code that forms the site instead of the values, you can do something like this example I made:
import urllib
url = urllib.urlopen('http://www.xpn.org/events/concert-calendar')
htmldata = url.readlines()
url.close()
for tag in htmldata:
if '<th' in tag:
print tag
if '<tr' in tag:
print tag
if '<thead' in tag:
print tag
if '<tbody' in tag:
print tag
if '<td' in tag:
print tag
You can't do this with BeautifulSoup (at least not to my knowledge) is because BeautifulSoup is more for parsing or printing the HTML in a nice looking manner. You can just do what I did and have a for loop go through the HTML code and if a tag is in the line, then print it.
If you want to store the output in a list to use later, you would do something like:
htmlCodeList = []
for tag in htmldata:
if '<th' in tag:
htmlCodeList.append(tag)
if '<tr' in tag:
htmlCodeList.append(tag)
if '<thead' in tag:
htmlCodeList.append(tag)
if '<tbody' in tag:
htmlCodeList.append(tag)
if '<td' in tag:
htmlCodeList.append(tag)
This save the HTML line in a new element of the list. so <td>
would be index 0 the next set of tags would be index 1, etc.