I have been working with the following bit of code, attempting to extract the text elements of this webpage.
site= 'http://football.fantasysports.yahoo.com/f1/1785/4/team?&week=4'
print site
response = urllib2.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
position = soup.find_all('span', class_="Fz-xxs")
for j in range(0,13):
positionlist = str(position[j].get_text())
print (positionlist)
Unfortunately, the text itself that is being put into the positionlist string contains many hyphens (ie: SEA-RB) that are not able to be encoded. When I attempt to run the code as it is I get the following response:
Traceback (most recent call last):
File "/Users/masongardner/Desktop/TestSorter.py", line 20, in <module>
positionlist = str(position[j].get_text())
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue002' in position 0: ordinal not in range(128)
I am aware that the hyphen cannot be encoded, but I am not sure how to change the coding so that I can have unicode interpret the hyphen if possible, or otherwise ignore the hyphen and just encode the text before and after for each instance. This project is purely for my own use, and so a hackerish approach is not a problem!
Thanks Everyone!
Don't try to casting to a str
just print the unicode string you get from get_text
:
site= 'http://football.fantasysports.yahoo.com/f1/1785/4/team?&week=4'
print site
response = urllib2.urlopen(site)
html = response.read()
soup = BeautifulSoup(html)
position = soup.find_all('span', class_="Fz-xxs")
for j in range(0,13):
positionlist = (position[j].get_text()) # unicode string
print (positionlist)
Viewing Info for League: The League (ID# 1785)
# http://chars.suikawiki.org/char/E002
Since '08
Jax - QB
Atl - WR
Ten - WR
You are seeing exactly what is in the source <span class="F-icon Fz-xxs"></span></a>
If you want to ignore that character use if positionlist != u"\ue002":
You can also use unicodedata:
import unicodedata
print unicodedata.normalize('NFKD', positionlist).encode('ascii','ignore')