Search code examples
pythonpandascsvunicode

Read a CSV without knowing Unicode


I tried to read the CSV (stored in my OneDrive) all day long - with no success.

https://1drv.ms/u/s!ArV4HGOvFQJuhap7lf6ETfpcnLl6Og?e=7SEeL7

Best attempt until now: I got a dataframe with one column and the correct number of rows - but normally it should be around 20 columns. I found many interesting posts here. The best one was https://stackoverflow.com/a/75445718, which I implemented.

I changed the return line as follows:

    try:
        byte_size = int(os.path.getsize(path) * size)

        with open(path, "rb") as rawdata:
            result = chardet.detect(rawdata.read(byte_size))

        return pd.read_csv(path, encoding=result["encoding"], header=1, sep='\t')

    except UnicodeError:
        return self.read_csv(path=path, size=size + 0.20)

I have no idea why it "works" (a bit) with \t... if I open the CSV in notepad there are ; as separator. Notepad++ tells me it's "UTF-16 Little Endian", but it doesn't work (Pandas doesn't know this code).

In the end, I would try to read it line by line and build my own dataframe... but this would be the last move.


Solution

  • Why not simply specifying the encoding ?

    The header contains some Latin-1 characters like the degree sign (°) :

    fp = r"Weathercloud fambon05 2023-12.csv"
    
    df = pd.read_csv(fp, sep=";", encoding="latin-1") # or ansi
    

    Output :

    Datum (Europe/Berlin) Temperatur Innen (°C)  ... Solarstrahlung (W/m²) UV Index
         2023-12-01 00:00                  21,4  ...                     0     0.00
         2023-12-01 00:10                  21,4  ...                     0     0.00
         2023-12-01 00:20                  21,3  ...                     0     0.00
         2023-12-01 00:30                  21,4  ...                     0     0.00
         2023-12-01 00:40                  21,4  ...                     0     0.00
                      ...                   ...  ...                   ...      ...
         2023-12-31 23:10                  19,3  ...                   NaN      NaN
         2023-12-31 23:20                  19,4  ...                   NaN      NaN
         2023-12-31 23:30                  19,3  ...                   NaN      NaN
         2023-12-31 23:40                  19,2  ...                   NaN      NaN
         2023-12-31 23:50                  19,4  ...                   NaN      NaN
    
    [4464 rows x 19 columns]