I am processing XML files from a third party. These files occasionally have invalid characters in them which causes XMLTextReader.Read()
to throw an exception.
I am currently handling this with the following function:
XmlTextReader GetCharSafeXMLTextReader(string fileName)
{
try
{
MemoryStream ms = new MemoryStream();
StreamReader sr = new StreamReader(fileName);
StreamWriter sw = new StreamWriter(ms);
string temp;
while ((temp = sr.ReadLine()) != null)
sw.WriteLine(temp.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), ""));
sw.Flush();
sr.Close();
ms.Seek(0, SeekOrigin.Begin);
return new XmlTextReader(ms);
}
catch (Exception exp)
{
throw new Exception("Error parsing file: " + fileName + " " + exp.Message, exp.InnerException);
}
}
My gut is saying there should be a better/faster way to do this. (And yes, getting the third party to fix their XMLs would be great, but it's not happening at this point.)
EDIT: Here is the final solution, based on cfeduke's answer:
public class SanitizedStreamReader : StreamReader
{
public SanitizedStreamReader(string filename) : base(filename) { }
/* other ctors as needed */
// this is the only one that XmlTextReader appears to use but
// it is unclear from the documentation which methods call each other
// so best bet is to override all of the Read* methods and Peek
public override string ReadLine()
{
return Sanitize(base.ReadLine());
}
public override int Read()
{
int temp = base.Read();
while (temp == 0x4 || temp == 0x14)
temp = base.Read();
return temp;
}
public override int Peek()
{
int temp = base.Peek();
while (temp == 0x4 || temp == 0x14)
{
temp = base.Read();
temp = base.Peek();
}
return temp;
}
public override int Read(char[] buffer, int index, int count)
{
int temp = base.Read(buffer, index, count);
for (int x = index; x < buffer.Length; x++)
{
if (buffer[x] == 0x4 || buffer[x] == 0x14)
{
for (int a = x; a < buffer.Length - 1; a++)
buffer[a] = buffer[a + 1];
temp--; //decrement the number of characters read
}
}
return temp;
}
private static string Sanitize(string unclean)
{
if (unclean == null)
return null;
if (String.IsNullOrEmpty(unclean))
return "";
return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), "");
}
}
Sanitizing data is important. Sometimes edge cases - invalid characters in "XML" - do occur. Your solution is correct. If you want a solution that fits into the .NET framework in regards to streaming restructure your code to fit into its own Stream:
public class SanitizedStreamReader : StreamReader {
public SanitizedStreamReader(string filename) : base(filename) { }
/* other ctors as needed */
// it is unclear from the documentation which methods call each other
// so best bet is to override all of the Read* methods and Peak
public override string ReadLine() {
return Sanitize(base.ReadLine());
}
// TODO override Read*, Peak with a similar logic as this.ReadLine()
// remember Read(Char[], Int32, Int32) to modify the return value by
// the number of removed characters
private static string Sanitize(string unclean) {
if (String.IsNullOrEmpty(unclean)
return "";
return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14);
}
}
With this new SanitizedStreamReader
you'll be able to chain it into processing streams as necessary, rather than relying on a magic method to clean things and present you with an XmlTextReader:
return new XmlTextReader(new SanitizedStreamReader("filename.xml"));
Admittedly this may be more work than necessary but you will gain flexibility from this approach.