I tried to read the CSV (stored in my OneDrive) all day long - with no success.
Best attempt until now: I got a dataframe with one column and the correct number of rows - but normally it should be around 20 columns. I found many interesting posts here. The best one was https://stackoverflow.com/a/75445718, which I implemented.
I changed the return line as follows:
byte_size = int(os.path.getsize(path) * size)
with open(path, "rb") as rawdata:
result = chardet.detect(rawdata.read(byte_size))
return pd.read_csv(path, encoding=result["encoding"], header=1, sep='\t')
except UnicodeError:
return self.read_csv(path=path, size=size + 0.20)
I have no idea why it "works" (a bit) with \t... if I open the CSV in notepad there are ;
as separator. Notepad++ tells me it's "UTF-16 Little Endian", but it doesn't work (Pandas doesn't know this code).
In the end, I would try to read it line by line and build my own dataframe... but this would be the last move.
Why not simply specifying the encoding ?
The header contains some Latin-1 characters like the degree sign (°
) :
fp = r"Weathercloud fambon05 2023-12.csv"
df = pd.read_csv(fp, sep=";", encoding="latin-1") # or ansi
Output :
Datum (Europe/Berlin) Temperatur Innen (°C) ... Solarstrahlung (W/m²) UV Index
2023-12-01 00:00 21,4 ... 0 0.00
2023-12-01 00:10 21,4 ... 0 0.00
2023-12-01 00:20 21,3 ... 0 0.00
2023-12-01 00:30 21,4 ... 0 0.00
2023-12-01 00:40 21,4 ... 0 0.00
... ... ... ... ...
2023-12-31 23:10 19,3 ... NaN NaN
2023-12-31 23:20 19,4 ... NaN NaN
2023-12-31 23:30 19,3 ... NaN NaN
2023-12-31 23:40 19,2 ... NaN NaN
2023-12-31 23:50 19,4 ... NaN NaN
[4464 rows x 19 columns]