Search code examples
.netxmltextreader

How to remove all instances of a character from a file in C#?


I am processing XML files from a third party. These files occasionally have invalid characters in them which causes XMLTextReader.Read() to throw an exception.

I am currently handling this with the following function:

XmlTextReader GetCharSafeXMLTextReader(string fileName)
{
    try
    {
        MemoryStream ms = new MemoryStream();
        StreamReader sr = new StreamReader(fileName);
        StreamWriter sw = new StreamWriter(ms);
        string temp;
        while ((temp = sr.ReadLine()) != null)
            sw.WriteLine(temp.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), ""));

        sw.Flush();
        sr.Close();
        ms.Seek(0, SeekOrigin.Begin);
        return new XmlTextReader(ms);
    }
    catch (Exception exp)
    {
        throw new Exception("Error parsing file: " + fileName + " " + exp.Message, exp.InnerException);
    }
}

My gut is saying there should be a better/faster way to do this. (And yes, getting the third party to fix their XMLs would be great, but it's not happening at this point.)

EDIT: Here is the final solution, based on cfeduke's answer:


    public class SanitizedStreamReader : StreamReader
    {
        public SanitizedStreamReader(string filename) : base(filename) { }
        /* other ctors as needed */
        // this is the only one that XmlTextReader appears to use but
        // it is unclear from the documentation which methods call each other
        // so best bet is to override all of the Read* methods and Peek
        public override string ReadLine()
        {
            return Sanitize(base.ReadLine());
        }

        public override int Read()
        {
            int temp = base.Read();
            while (temp == 0x4 || temp == 0x14)
                temp = base.Read();
            return temp;
        }

        public override int Peek()
        {
            int temp = base.Peek();
            while (temp == 0x4 || temp == 0x14)
            {
                temp = base.Read();
                temp = base.Peek();
            }
            return temp;
        }

        public override int Read(char[] buffer, int index, int count)
        {
            int temp = base.Read(buffer, index, count);
            for (int x = index; x < buffer.Length; x++)
            {
                if (buffer[x] == 0x4 || buffer[x] == 0x14)
                {
                    for (int a = x; a < buffer.Length - 1; a++)
                        buffer[a] = buffer[a + 1];
                    temp--; //decrement the number of characters read
                }  
            }
            return temp;
        }

        private static string Sanitize(string unclean)
        {
            if (unclean == null)
                return null;
            if (String.IsNullOrEmpty(unclean))
                return "";
            return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14).ToString(), "");
        }
    }

Solution

  • Sanitizing data is important. Sometimes edge cases - invalid characters in "XML" - do occur. Your solution is correct. If you want a solution that fits into the .NET framework in regards to streaming restructure your code to fit into its own Stream:

    public class SanitizedStreamReader : StreamReader {
      public SanitizedStreamReader(string filename) : base(filename) { }
      /* other ctors as needed */
    
      // it is unclear from the documentation which methods call each other
      // so best bet is to override all of the Read* methods and Peak
      public override string ReadLine() {
        return Sanitize(base.ReadLine());
      }
    
      // TODO override Read*, Peak with a similar logic as this.ReadLine()
      // remember Read(Char[], Int32, Int32) to modify the return value by
      // the number of removed characters
    
      private static string Sanitize(string unclean) {
        if (String.IsNullOrEmpty(unclean)
          return "";
        return unclean.Replace(((char)4).ToString(), "").Replace(((char)0x14);
      }
    }
    

    With this new SanitizedStreamReader you'll be able to chain it into processing streams as necessary, rather than relying on a magic method to clean things and present you with an XmlTextReader:

    return new XmlTextReader(new SanitizedStreamReader("filename.xml"));
    

    Admittedly this may be more work than necessary but you will gain flexibility from this approach.