Search code examples
pythonutf-8replaceasciistrip

Python convert non standard characters


I have a list that I pulled from a webpage that contains some non standard characters.

List example:

[<td class="td-number-nowidth"> 10 115 </td>, <td class="td-number-nowidth"> 4 635 (46%) </td>, <td class="td-number-nowidth"> 5 276 (52%) </td>, ...]

The A with the hat is supposed to be a comma. Can someone suggest how to convert or replace these so I can get at the value 10115 as in the first value in the list?

Source code:

from urllib import urlopen
from bs4 import BeautifulSoup
import re, string
content = urlopen('http://www.worldoftanks.com/community/accounts/1000395103-FrankenTank').read()
soup = BeautifulSoup(content)

BattleStats = soup.find_all('td', 'td-number-nowidth')
print BattleStats

Thanks, Frank


Solution

  • Does the website say about the encoding in it's Content-Encoding header? You have to get that, and decode the those strings in the list using .decode method. It will be like encoded_string.decode("encoding"). The encoding could be anything, utf-8 being one of them.