Search code examples
matlabtextanalysis

Matlab misreading ascii text file


This is a problem in analyzing some text files using Matlab, which is screwing up some of the text. I am using R2017a (9.2.0.538062) 64-bit (maci64). Please note the accented characters.

Other text editors are reading the file ("War and Peace.txt") correctly (Textmate, Emacs, Textedit, and GNU Octave), as well as other programs (Python, Ruby, Mathematica).

It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna.

Whereas in Matlab

It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna.

My Question

Is there a Matlab (preferences?) setting that will read Ascii text accurately? Matlab appears to be garbling valid Ascii characters (mostly in the 200-256 range).


Solution

  • I actually faced the same problem as yours, when trying to read string from a text file. The problem with me was that I saved the .txt file in ANSI Encoding Format. After many trials, I came up with a solution. First you have to save the file in UTF-8 Encoding format. Like this:

    test

    Then in your MATLAB code, you should specify the encondigIn in fopencommand.

    A test code can be something like:

    close all;clearvars;clc;
    
    fileID = fopen('text.txt', 'r', 'n', 'UTF-8');
    C = textscan(fileID, '%s');
    fclose(fileID);
    
    celldisp(C) 
    

    The output of this code would be:

    C{1}{1} =
    
    It
    
    
    C{1}{2} =
    
    was
    
    
    C{1}{3} =
    
    in
    
    
    C{1}{4} =
    
    July,
    
    
    C{1}{5} =
    
    1805,
    
    
    C{1}{6} =
    
    and
    
    
    C{1}{7} =
    
    the
    
    
    C{1}{8} =
    
    speaker
    
    
    C{1}{9} =
    
    was
    
    
    C{1}{10} =
    
    the
    
    
    C{1}{11} =
    
    well-known
    
    
    C{1}{12} =
    
    Anna
    
    
    C{1}{13} =
    
    Pávlovna
    
    
    C{1}{14} =
    
    Schérer,
    
    
    C{1}{15} =
    
    maid
    
    
    C{1}{16} =
    
    of
    
    
    C{1}{17} =
    
    honor
    
    
    C{1}{18} =
    
    and
    
    
    C{1}{19} =
    
    favorite
    
    
    C{1}{20} =
    
    of
    
    
    C{1}{21} =
    
    the
    
    
    C{1}{22} =
    
    Empress
    
    
    C{1}{23} =
    
    Márya
    
    
    C{1}{24} =
    
    Fëdorovna.