Search code examples
rsasbyte-order-mark

Getting rid of BOM between SAS and R


I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:

read.table(myfile, header =TRUE, sep = "\t")

To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.

This is not a new issue of course; they address it briefly here, and recommend using

read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")

However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:

read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")

In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and  if I don't use fileEncoding).

Isn't there a simple way to just remove the BOM and use read.table without any special arguments?

Update for @Joe: The SAS that I used:

FILENAME myfile 'C:\Documents ... file.txt'  encoding="utf-8";
proc export data=lib.sastable
  outfile=myfile
  dbms=tab  replace;
  putnames=yes;
run;

Update on further weirdness: Using fileEncoding="UTF-8-BOM" as @Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?

Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.


Solution

  • As per your link, it looks like it works for me with:

    read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
    

    note the -BOM in the file encoding.

    This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).