I have the file prova.txt
:
01 ±00CC00 2023-07-21
50 MSTAT»BR_02»BR_07»BR_14 ±000066 2023-07-19
01 ±00CC00 2023-07-21
which has this encoding (file -bi prova.txt
):
text/plain; charset=iso-8859-1
I'm trying to import it in SAS with this program:
libname pathdata "/my/dir/dataset";
filename inp "/my/dir/file/prova.txt";
data pathdata.prova;
Infile inp /*encoding="wlatin1"*/ lrecl=270 DSD MISSOVER PAD firstObs=1;
Attrib colore length=$49
format=$char49. informat=$char49. ;
Attrib orig length=$200
format=$char200. informat=$char200. ;
Attrib app length=$10
format=$char10. informat=$char10. ;
Attrib data_v length=$10
format=$char10. informat=$char10.;
Input
@1 colore $char49.
@50 orig $char200.
@250 app $char10.
@260 data_v $char10.
;
run;
If I don't use encoding="wlatin1"
I got wrong chars in the SAS dataset:
If I use encoding="wlatin1"
I got the correct chars but the following variables are shifted:
The session encoding is ENCODING=UTF-8
.
Read it with ENCODING=ANY and then transcode the strings yourself.
Make sure to define the variables as long enough to hold the UTF-8 version of the text, so longer than the number of bytes read from the file.
data test;
infile inp encoding="any" truncover;
length colore $60 orig $250 app $15 data_v $15 ;
input colore $char49. orig $char200. app $char10. data_v $char10. ;
array _c _character_;
do over _c;
_c=kcvt(_c,'wlatin1','utf-8');
end;
run;
Or read the file using WLATIN1 encoding, but pull the strings from the _INFILE_ variable using KSUBSTR() instead of the INPUT statement.
data test;
infile inp encoding="wlatin1" truncover;
length colore $60 orig $250 app $15 data_v $15 ;
input ;
colore =ksubstr(_infile_,1,49);
orig =ksubstr(_infile_,50,200);
app =ksubstr(_infile_,250,10);
data_v = ksubstr(_infile_,260,10);
run;
The reason you are having trouble when using ENCODING="WLATIN1" when reading the file in a SAS session that is using UTF-8 encoding is the lines are transcoded when read. So the location on the line of the APP and DATA_V field move when the non-ASCII characters are transcoded from single byte to multi-byte.
If you did not want to read by column position, but instead had a delimited like, like a CSV file, you would not have had trouble.