Search code examples
c#file-encodings

c# getting anc changing the file encoding


I'm a little bit confused about the file encoding. I want to change it. Here is my code:

public class ChangeFileEncoding
    {
        private const int BUFFER_SIZE = 15000;

        public static void ChangeEncoding(string source, Encoding destinationEncoding)
        {
            var currentEncoding = GetFileEncoding(source);
            string destination = Path.GetDirectoryName(source) +@"\"+ Guid.NewGuid().ToString() + Path.GetExtension(source);
            using (var reader = new StreamReader(source, currentEncoding))
            {
                using (var writer =new StreamWriter(File.OpenWrite(destination),destinationEncoding ))
                {
                    char[] buffer = new char[BUFFER_SIZE];
                    int charsRead;
                    while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
                    {
                        writer.Write(buffer, 0, charsRead);                        
                    }
                }
            }
            File.Delete(source);
            File.Move(destination, source);
        }

        public static Encoding GetFileEncoding(string srcFile)
        {
            using (var reader = new StreamReader(srcFile))
            {
                reader.Peek();
                return reader.CurrentEncoding;
            }
        }
    }

And in the Program.cs I have the code:

    string file = @"D:\path\test.txt";
    Console.WriteLine(ChangeFileEncoding.GetFileEncoding(file).EncodingName);
    ChangeFileEncoding.ChangeEncoding(file, new System.Text.ASCIIEncoding());
    Console.WriteLine(ChangeFileEncoding.GetFileEncoding(file).EncodingName);

And the text printed in my console is:

Unicode (UTF-8)

Unicode (UTF-8)

Why the file's encoding it's not changed? I am wrong in changing the file's encoding?

Regards


Solution

  • The StreamReader class, when not passed an Encoding in its constructor, will try to automatically detect the encoding of a file. It will do so just fine when the file starts with a BOM (and you should write the preamble when changing the encoding of a file to facilitate this the next time you want to read the file).

    Properly detecting the encoding of a text file is a Hard Problem, especially for non-Unicode files or Unicode files without a BOM. The reader (whether StreamReader, Notepad++ or any other reader) will have to guess which encoding is being used in the file.

    See also How can I detect the encoding/codepage of a text file, emphasis mine:

    You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results.

    Because ASCII (characters 0-127) is a subset of Unicode, it's safe to read an ASCII file with a one-byte Unicode encoding (being UTF-8). Hence the StreamReader using that encoding.

    That is, as long as it's truly ASCII. Any character above code point 127 will be ANSI, and then you're into the fun of detecting guessing the correct code page.

    So to answer your question: you have changed the file's encoding, there simply is no fool-proof way to "detect" it, you can merely guess it.

    Required reading material: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode, UTF, ASCII, ANSI format differences.