Search code examples
c#textutf-8character-encodingglobalization

Read text file with (multiple / mixed / more than one) encoding


I have a text file with more than one encoding where the encoding to use is itself specified in the text file (the vCard format is an example which allows this). Here's an example:

charset=windows-1251: ABCDE
charset=utf-8: VWXYZ

...where "ABCDE" will be interpreted as encoding "windows-1251" and "VWXYZ" will be in UTF8. Ultimately, I want it all converted to a standard string (which is UTF2/UTF16 in C#).

I think I want to use RealAllText() because that apparently helps to automatically the default encoding if not otherwise specified. When a charset is specified as above, that would override the default encoding.

Unfortunately, I'd also need to do some text parsing to look for the various encodings, so I think ReadAllBytes() would be needed, so I can parse character by character in a more raw format.

I want it to be be fast too. What's the best way of dealing with this?


Solution

  • Assuming all the metadata about the encoding is going to be in ASCII, you could decode it with some lenient single-byte-based encoding, which would allow you to parse the text as usual. Then reparse (from bytes) each string with an appropriate encoding.

    Some silly example code:

    var encoding = Encoding.GetEncoding("Windows-1252");
    string asString = System.IO.File.ReadAllText("C:/Temp/test.txt", encoding);
    byte[] asBytes = System.IO.File.ReadAllText("C:/Temp/test.txt");
    
    foreach(var entry in ParseFile(aString))
    {
        int start = entry.PositionInString;
        // Since we used a one-byte encoding, we can use this location
        // directly in the byte-array.
    
        int length = entry.Length;
        string encoding = entry.Encoding;
        string decodedEntry = Encoding.GetEncoding(encoding)
                                      .GetString(bytes, start, length);
        Console.WriteLine(decodedEntry);
    }