Search code examples
encodingsas

I can't import a iso-8859-1 txt in SAS (session UTF-8)


I have the file prova.txt:

01                                                                                                                                                                                                                                                       ±00CC00   2023-07-21
50                                               MSTAT»BR_02»BR_07»BR_14                                                                                                                                                                                 ±000066   2023-07-19
01                                                                                                                                                                                                                                                       ±00CC00   2023-07-21

which has this encoding (file -bi prova.txt):

text/plain; charset=iso-8859-1

I'm trying to import it in SAS with this program:

libname pathdata "/my/dir/dataset";

filename inp "/my/dir/file/prova.txt";
    
data pathdata.prova;
Infile inp /*encoding="wlatin1"*/ lrecl=270 DSD MISSOVER PAD firstObs=1;                                                                                                                                                                                                         
Attrib colore length=$49                                                                                                                                                                                                                            
format=$char49. informat=$char49. ;                                                                                                                                                                               
Attrib orig length=$200                                                                                                                                                                                                                         
format=$char200. informat=$char200. ;                                                                                                                                                                        
Attrib app length=$10                                                                                                                                                                                                                        
format=$char10. informat=$char10. ;                                                                                                                                      
Attrib data_v length=$10                                                                                                                                                                                                                             
format=$char10. informat=$char10.;                                                                                                                                                                                                    

  Input
        @1 colore $char49.
        @50 orig $char200.
        @250 app $char10.
        @260 data_v $char10.
;

run;

If I don't use encoding="wlatin1" I got wrong chars in the SAS dataset:

enter image description here

If I use encoding="wlatin1" I got the correct chars but the following variables are shifted:

enter image description here

The session encoding is ENCODING=UTF-8.


Solution

  • Read it with ENCODING=ANY and then transcode the strings yourself.

    Make sure to define the variables as long enough to hold the UTF-8 version of the text, so longer than the number of bytes read from the file.

    data test;
      infile inp encoding="any" truncover; 
      length colore $60 orig $250 app $15 data_v $15 ;
      input colore $char49. orig $char200. app $char10. data_v $char10. ;
      array _c _character_;
      do over _c;
        _c=kcvt(_c,'wlatin1','utf-8');
      end;
    run;
    

    Or read the file using WLATIN1 encoding, but pull the strings from the _INFILE_ variable using KSUBSTR() instead of the INPUT statement.

    data test;
      infile inp encoding="wlatin1" truncover; 
      length colore $60 orig $250 app $15 data_v $15 ;
      input ;
      colore =ksubstr(_infile_,1,49);
      orig =ksubstr(_infile_,50,200);
      app =ksubstr(_infile_,250,10);
      data_v = ksubstr(_infile_,260,10);
    run;
    

    enter image description here

    The reason you are having trouble when using ENCODING="WLATIN1" when reading the file in a SAS session that is using UTF-8 encoding is the lines are transcoded when read. So the location on the line of the APP and DATA_V field move when the non-ASCII characters are transcoded from single byte to multi-byte.

    If you did not want to read by column position, but instead had a delimited like, like a CSV file, you would not have had trouble.