Search code examples
c#windows-10uwpwindows-10-mobilewindows-10-universal

read unicode string from text file in UWP app


in Windows 10 app I try to read string from .txt file and set the text to RichEditBox:

Code variant 1:

var read = await FileIO.ReadTextAsync(file, Windows.Storage.Streams.UnicodeEncoding.Utf8);
txt.Document.SetText(Windows.UI.Text.TextSetOptions.None, read);

Code variant 2:

var stream = await file.OpenAsync(Windows.Storage.FileAccessMode.ReadWrite);
ulong size = stream.Size;
using (var inputStream = stream.GetInputStreamAt(0))
{
    using (var dataReader = new Windows.Storage.Streams.DataReader(inputStream))
    {
        dataReader.UnicodeEncoding = Windows.Storage.Streams.UnicodeEncoding.Utf8;
        uint numBytesLoaded = await dataReader.LoadAsync((uint)size);
        string text = dataReader.ReadString(numBytesLoaded);
        txt.Document.SetText(Windows.UI.Text.TextSetOptions.FormatRtf, text);
    }
}

On some files I have this error - "No mapping for the Unicode character exists in the target multi-byte code page"

I found one solution:

IBuffer buffer = await FileIO.ReadBufferAsync(file);
DataReader reader = DataReader.FromBuffer(buffer);
byte[] fileContent = new byte[reader.UnconsumedBufferLength];
reader.ReadBytes(fileContent);
string text = Encoding.UTF8.GetString(fileContent, 0, fileContent.Length);
txt.Document.SetText(Windows.UI.Text.TextSetOptions.None, text);

But with this code the text looks like question marks in rhombus.

How I can read and display same text files in normal encoding?


Solution

  • Solution:

    1) I made a port of Mozilla Universal Charset Detector to UWP (added to Nuget)

    ICharsetDetector cdet = new CharsetDetector();
    cdet.Feed(fileContent, 0, fileContent.Length);
    cdet.DataEnd();
    

    2) Nuget library Portable.Text.Encoding

    if (cdet.Charset != null)
    string text = Portable.Text.Encoding.GetEncoding(cdet.Charset).GetString(fileContent, 0, fileContent.Length);
    

    That's all. Now unicode ecnodings (include cp1251, cp1252) - works good ))