Search code examples
pythonpandascharacter-encodingtext-files

Pandas read_csv: Data is not being read from text file (open() reads hex chars)


I'm trying to read a text file with pandas.read_csv, but data is not being loaded (only a dataframe with NA values. The text file contains valid data (I can open it with excel). When I try to read it with pathlib.Path.open() it shows lines with Hex codes.

Let me show you what is happening:

import pandas as pd
from pathlib import Path

path = Path('path/to/my/file.txt')
# This shows an error: Unidecode Error... as usual with windows files
df = pd.read_csv(path, dtype=str) 
## UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 96: invalid continuation byte

# This imports a dataframe full of null values:
df = pd.read_csv(path, dtype=str, encoding='latin1') 
print(df)
##           C Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6  \
## 0     <NA>       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>   
## 1     <NA>       <NA>       <NA>       <NA>       <NA>       <NA>       <NA>  
## ...

# So, what is Python reading? I tried this:
with path.open('r') as f:
    data = f.readline()
print(data)
## 'C\x00e\x00n\x00t\x00r\x00o\x00 \x00B\x00e\x00n\x00e\x00f\x00i\x00c\x00i\x00o\x00s\x00\n

And, as I said before, when I open the file with Excel, it shows exactly how it is supposed to look: a text files with values separated by pipes (|). So, right now, I'm feeling quite surprised.

What am I missing? Can anyone point me in the right direction? Which is the right encoding?


Solution

  • This suggests that the encoding of your text file is neither utf-8 nor latin1. Try 'UTF-16 Little Endian' by editing this line..

    df = pd.read_csv(path, dtype=str, encoding='utf-16le')