I am trying to extract £ signs from an XML feed (full version is here > http://mjsiphone.com/scotjobsnet/)
A mini version is here > http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml
I have no control over the source of the feed or the source server or how the feeds are formatted, the headers they use etc or so on.
I just have to build a .NET 4.5 console application that will extract the data and save it on our own sites database.
Also I have to strip all HTML (from job descriptions) and remove any HTML Encoded characters and replace them with their real values.
Therefore I need to save real £ signs in nvarchar datatypes in an MS SQL 2008 DB not £20,000 or £20,000 etc.
When viewing the source of the feed it has UTF-8 at the top of it.
However when viewing the feed in a browsers source I don't see any mention of UTF-8 as a Request/Response header and in the Request Headers (Chrome) I only see:
Accept-Language:en-GB,en-US;q=0.8,en;q=0.6
When I copy and paste the characters from the browser or the console into SQL and check them they return 163 which is the correct ASCII character encoding e.g £
If you view the feed in a browser the pound signs show up fine.
When I just pipe out the content to a Windows Command console they show up fine as £ signs.
However when I try to save them to the DB or pipe the console debug out to a file in EditPlus (with character encoding set to UTF8 or ASCII) I just get squares in front of the numbers instead of the signs e.g in CMD
[path to .exe] > [path to debug.log file]
Either the console cannot pipe across the content to the editor correctly or I need to use the right encoding or pass along more headers or extract the XML differently.
Here is an example of the code I am using for testing this with, using just one field that uses £ signs in it and then breaking.
static void Main(string[] args)
{
Console.WriteLine("START");
XmlDocument xDoc = new XmlDocument();
string feedURL = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";
WebClient webClient = new WebClient();
// need to pass a user-agent > 10 Chars to prevent blocking by OUR servers 403
webClient.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");
// piping out to console with this line below shows a £ but to a UTF-8 or ASCII file it's gibberish
webClient.Headers.Add("Content-Type", "application/xml; charset=utf-8");
// I tried this but still the console works but piping to an editor in UTF-8 or ASCII shows squares
webClient.Headers.Add("Accept-Language", "utf-8,en-GB,en-US;q=0.8,en;q=0.6");
// download as text - is this the problem? Should I be using a different method
string feedText = webClient.DownloadString(feedURL);
// load into XML object
xDoc.LoadXml(feedText);
if (xDoc != null)
{
XmlElement root = xDoc.DocumentElement;
XmlNodeList xNodelst = root.SelectNodes("job");
foreach (XmlNode node in xNodelst)
{
string salary = node.SelectSingleNode("candidateSalary").InnerText;
// piped to cmd console the £ signs show but to a UTF-8 file they are just squares
// I've tried adding the Encoding.UTF8 or Encoding.ASCII still no joy
// Console.WriteLine("candidateSalary = " + salary,Encoding.UTF8);
Console.WriteLine("candidateSalary = " + salary);
break;
}
}
Console.WriteLine("FINISH");
}
Any help will be much appreciated.
I am sure it's just a header I need to pass or maybe an issue with outputting the XML content to the Editor.
As I said before just viewing the output in a Windows console the £ show up fine.
Thanks
I expect the output of this command is not UTF-8:
Console.WriteLine(Console.OutputEncoding);
There are two transcoding operations here:
UTF-8 > UTF-16 string > console encoding
The correct way to detect the XML document encoding is described in the XML specification. XmlDocument will do this for you.
The console encoding can be set to UTF-8 or you could serialize encoded bytes to STDOUT directly.
Console.OutputEncoding = System.Text.Encoding.UTF8;
XmlDocument xDoc = new XmlDocument();
string feedURL = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");
byte[] feed = client.DownloadData(feedURL);
xDoc.Load(new MemoryStream(feed));
if (xDoc != null)
{
XmlElement root = xDoc.DocumentElement;
XmlNodeList xNodelst = root.SelectNodes("job");
foreach (XmlNode node in xNodelst)
{
string salary = node.SelectSingleNode("candidateSalary").InnerText;
Console.WriteLine("candidateSalary = " + salary);
break;
}
}
More on cmd.exe and Unicode here.