Search code examples
pythonpython-2.7unicode-string

How to escape a unicode error when converting a BeautifulSoup object to a string


I have been working with the following bit of code, attempting to extract the text elements of this webpage.

site= 'http://football.fantasysports.yahoo.com/f1/1785/4/team?&week=4'
print site
response = urllib2.urlopen(site)
html = response.read()

soup = BeautifulSoup(html)
position = soup.find_all('span', class_="Fz-xxs")
for j in range(0,13):
    positionlist = str(position[j].get_text())

print (positionlist)

Unfortunately, the text itself that is being put into the positionlist string contains many hyphens (ie: SEA-RB) that are not able to be encoded. When I attempt to run the code as it is I get the following response:

Traceback (most recent call last):
  File "/Users/masongardner/Desktop/TestSorter.py", line 20, in <module>
    positionlist = str(position[j].get_text())
UnicodeEncodeError: 'ascii' codec can't encode character u'\ue002' in position 0: ordinal not in range(128)

I am aware that the hyphen cannot be encoded, but I am not sure how to change the coding so that I can have unicode interpret the hyphen if possible, or otherwise ignore the hyphen and just encode the text before and after for each instance. This project is purely for my own use, and so a hackerish approach is not a problem!

Thanks Everyone!


Solution

  • Don't try to casting to a str just print the unicode string you get from get_text:

    site= 'http://football.fantasysports.yahoo.com/f1/1785/4/team?&week=4'
    
    print site
    response = urllib2.urlopen(site)
    html = response.read()
    
    soup = BeautifulSoup(html)
    position = soup.find_all('span', class_="Fz-xxs")
    for j in range(0,13):
        positionlist = (position[j].get_text()) # unicode string
    
        print (positionlist)
    Viewing Info for League: The League (ID# 1785)
     # http://chars.suikawiki.org/char/E002
    
    
    
    
    Since '08
    Jax - QB
    
    Atl - WR
    
    Ten - WR
    

    You are seeing exactly what is in the source <span class="F-icon Fz-xxs">&#xe002;</span></a>

    If you want to ignore that character use if positionlist != u"\ue002":

    You can also use unicodedata:

     import unicodedata
     print unicodedata.normalize('NFKD', positionlist).encode('ascii','ignore')