Search code examples
xmlencodingutf-8xquerynon-ascii-characters

XML and accented characters


I'm extracting data from this website. I do use UTF-8 for my xml, the same charset for the website so I don't really understand why data arent encoded correctly.

For example, from this page I'm getting Astrit Ajdarević instead of Astrit Ajdarević, and Standard Liège instead of Standard Liège and so on...

Details: extracting how?

Well, I'm using WebHarvest wich transform the html page into valid xml before parsing it.

So, for the example above, I use //div[2]/div[1]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td[2]/text() to get Astrit Ajdarević and //*[@id="site"]//div[contains(./div/h2, 'Spieler')]//tbody/tr[2]/td[position()=3] to get Standard Liège...

I hope this answers your questions :)


Solution:

<html-to-xml>
     <http url="${link}" charset="utf-8"/>
</html-to-xml>

Thanks to mactwixs <3


Solution

  • You probably need to set UTF-8 as default in your Web-Harvest Config file otherwise it will not be set as default. Also ensure you have latest version of Web-Harvest (2.1)

    See the following:

    Manual - Config

    Manual - HTTP Config

    Similar Support Request

    The HTML that your browser resolves will also need:

    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    

    If non of that works I suggest raising a support request on sourceforge.