I'm extracting data from this website. I do use UTF-8 for my xml, the same charset for the website so I don't really understand why data arent encoded correctly.
For example, from this page I'm getting Astrit Ajdarević
instead of Astrit Ajdarević
, and Standard Liège
instead of Standard Liège
and so on...
Details: extracting how?
Well, I'm using WebHarvest wich transform the html page into valid xml before parsing it.
So, for the example above, I use //div[2]/div[1]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td[2]/text()
to get Astrit Ajdarević
and //*[@id="site"]//div[contains(./div/h2, 'Spieler')]//tbody/tr[2]/td[position()=3]
to get Standard Liège
...
I hope this answers your questions :)
Solution:
<html-to-xml>
<http url="${link}" charset="utf-8"/>
</html-to-xml>
Thanks to mactwixs <3
You probably need to set UTF-8 as default in your Web-Harvest Config file otherwise it will not be set as default. Also ensure you have latest version of Web-Harvest (2.1)
See the following:
The HTML that your browser resolves will also need:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
If non of that works I suggest raising a support request on sourceforge.