I can't import a iso-8859-1 txt in SAS (session UTF-8)

I have the file prova.txt:

01                                                                                                                                                                                                                                                       ±00CC00   2023-07-21
50                                               MSTAT»BR_02»BR_07»BR_14                                                                                                                                                                                 ±000066   2023-07-19
01                                                                                                                                                                                                                                                       ±00CC00   2023-07-21

which has this encoding (file -bi prova.txt):

text/plain; charset=iso-8859-1

I'm trying to import it in SAS with this program:

libname pathdata "/my/dir/dataset";

filename inp "/my/dir/file/prova.txt";
    
data pathdata.prova;
Infile inp /*encoding="wlatin1"*/ lrecl=270 DSD MISSOVER PAD firstObs=1;                                                                                                                                                                                                         
Attrib colore length=$49                                                                                                                                                                                                                            
format=$char49. informat=$char49. ;                                                                                                                                                                               
Attrib orig length=$200                                                                                                                                                                                                                         
format=$char200. informat=$char200. ;                                                                                                                                                                        
Attrib app length=$10                                                                                                                                                                                                                        
format=$char10. informat=$char10. ;                                                                                                                                      
Attrib data_v length=$10                                                                                                                                                                                                                             
format=$char10. informat=$char10.;                                                                                                                                                                                                    

  Input
        @1 colore $char49.
        @50 orig $char200.
        @250 app $char10.
        @260 data_v $char10.
;

run;

If I don't use encoding="wlatin1" I got wrong chars in the SAS dataset:

If I use encoding="wlatin1" I got the correct chars but the following variables are shifted:

The session encoding is ENCODING=UTF-8.

Solution

Read it with ENCODING=ANY and then transcode the strings yourself.

Make sure to define the variables as long enough to hold the UTF-8 version of the text, so longer than the number of bytes read from the file.

data test;
  infile inp encoding="any" truncover; 
  length colore $60 orig $250 app $15 data_v $15 ;
  input colore $char49. orig $char200. app $char10. data_v $char10. ;
  array _c _character_;
  do over _c;
    _c=kcvt(_c,'wlatin1','utf-8');
  end;
run;

Or read the file using WLATIN1 encoding, but pull the strings from the _INFILE_ variable using KSUBSTR() instead of the INPUT statement.

data test;
  infile inp encoding="wlatin1" truncover; 
  length colore $60 orig $250 app $15 data_v $15 ;
  input ;
  colore =ksubstr(_infile_,1,49);
  orig =ksubstr(_infile_,50,200);
  app =ksubstr(_infile_,250,10);
  data_v = ksubstr(_infile_,260,10);
run;

The reason you are having trouble when using ENCODING="WLATIN1" when reading the file in a SAS session that is using UTF-8 encoding is the lines are transcoded when read. So the location on the line of the APP and DATA_V field move when the non-ASCII characters are transcoded from single byte to multi-byte.

If you did not want to read by column position, but instead had a delimited like, like a CSV file, you would not have had trouble.