I'm trying to pull out a list of unique values from a data frame but I keep getting a value that I can't find anywhere in the original data frame. Has anyone run into something like this before?
I read in a text file:
tmpPandaObj = pd.read_csv(fn, sep='\t', header=None)
tmpPandaObj.columns = ['stockId','dt','hhmm','seq','ecalls']
Pull out the unique values:
uniqueStockIdVec = tmpPandaObj.stockId.unique()
Yet I keep getting '\ufeff19049' included in the unique vector. I've searched the text files and data frame as hard as I possibly can, with no luck finding any '\ufeff19049' value. The only unique values should be '19049', '24937', '139677'.
First, the fix: Specify encoding='UTF-8-sig'
when reading the file.
Now, the explanation:
\ufeff
is the Unicode BOM (Byte Order Mark) character. Whenever one tool writes a file with a BOM, and another tool reads the file using an explicit encoding like UTF-16-LE
instead of a BOM-switching version like UTF-16
, the BOM is treated as a normal character, so \ufeff
shows up in your string. Outside of Microsoft-land, this specific issue (reading UTF-16 as UTF-16-LE) is by far the most common version of this problem.
But if one of the tools is from Microsoft, it's more commonly UTF-8. The Unicode standard recommends never using a BOM with UTF-8 (because bytes don't need a byte-order mark), but doesn't quite forbid it, so many Microsoft tools keep doing it. And then every other tool, including Python (and Pandas), just reads it as UTF-8 without a BOM, causing an extra \ufeff
to show up. (Older, non-Unicode-friendly tools will read the same three bytes \xef\xbb\xbf
as something like 
, which you may have seen a few times.)
But while Python (and Pandas) defaults to UTF-8, it does let you specify an encoding manually, and one of the encodings it comes with is called UTF-8-sig
, which means UTF-8 with a useless BOM at the start.