xml encoding utf-8 xquery non-ascii-characters

XML and accented characters

I'm extracting data from this website. I do use UTF-8 for my xml, the same charset for the website so I don't really understand why data arent encoded correctly.

For example, from this page I'm getting Astrit AjdareviÄ instead of Astrit Ajdarević, and Standard LiÃ¨ge instead of Standard Liège and so on...

Details: extracting how?

Well, I'm using WebHarvest wich transform the html page into valid xml before parsing it.

So, for the example above, I use //div[2]/div[1]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td[2]/text() to get Astrit AjdareviÄ and //*[@id="site"]//div[contains(./div/h2, 'Spieler')]//tbody/tr[2]/td[position()=3] to get Standard LiÃ¨ge...

I hope this answers your questions :)

Solution:

<html-to-xml>
     <http url="${link}" charset="utf-8"/>
</html-to-xml>

Thanks to mactwixs <3

Solution

You probably need to set UTF-8 as default in your Web-Harvest Config file otherwise it will not be set as default. Also ensure you have latest version of Web-Harvest (2.1)

See the following:

Manual - Config

Manual - HTTP Config

Similar Support Request

The HTML that your browser resolves will also need:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

If non of that works I suggest raising a support request on sourceforge.