Search code examples
c#utf-8shift-jis

Convert a file froM Shift-JIS to UTF8 No BOM without re-reading from disk


I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.

Right now, I read in the file assuming Shift-JIS as such:

using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
   String line = sr.ReadToEnd();

   // Detection must be done AFTER you read from the file.  Silly rabbit.
   fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
                codingFromBOM = sr.CurrentEncoding;
}

and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:

using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
    String line = sr.ReadToEnd();
}

I am trying to avoid this re-read step and reinterperet the data in memory if possible.

Or is magic already happening and I am needlessly worrying about double I/O access?


Solution

  • var buf = File.ReadAllBytes(path);
    var text = Encoding.UTF8.GetString(buf);
    if (text.Contains("\uFFFD")) // Unicode replacement character
    {
        text = Encoding.GetEncoding(932).GetString(buf);
    }