Search code examples
pythonpandasunique

Unknown value keeps getting included in unique vector


I'm trying to pull out a list of unique values from a data frame but I keep getting a value that I can't find anywhere in the original data frame. Has anyone run into something like this before?

I read in a text file:

tmpPandaObj = pd.read_csv(fn, sep='\t', header=None)   
tmpPandaObj.columns = ['stockId','dt','hhmm','seq','ecalls']

Pull out the unique values:

uniqueStockIdVec = tmpPandaObj.stockId.unique()

Yet I keep getting '\ufeff19049' included in the unique vector. I've searched the text files and data frame as hard as I possibly can, with no luck finding any '\ufeff19049' value. The only unique values should be '19049', '24937', '139677'.


Solution

  • First, the fix: Specify encoding='UTF-8-sig' when reading the file.

    Now, the explanation:

    \ufeff is the Unicode BOM (Byte Order Mark) character. Whenever one tool writes a file with a BOM, and another tool reads the file using an explicit encoding like UTF-16-LE instead of a BOM-switching version like UTF-16, the BOM is treated as a normal character, so \ufeff shows up in your string. Outside of Microsoft-land, this specific issue (reading UTF-16 as UTF-16-LE) is by far the most common version of this problem.

    But if one of the tools is from Microsoft, it's more commonly UTF-8. The Unicode standard recommends never using a BOM with UTF-8 (because bytes don't need a byte-order mark), but doesn't quite forbid it, so many Microsoft tools keep doing it. And then every other tool, including Python (and Pandas), just reads it as UTF-8 without a BOM, causing an extra \ufeff to show up. (Older, non-Unicode-friendly tools will read the same three bytes \xef\xbb\xbf as something like , which you may have seen a few times.)

    But while Python (and Pandas) defaults to UTF-8, it does let you specify an encoding manually, and one of the encodings it comes with is called UTF-8-sig, which means UTF-8 with a useless BOM at the start.