Read space separated text file in pandas

I am trying to read a text file present in this url into a pandas dataframe. https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt

It has uneven spacing between columns. I have tried sep='\s+', delim_whitespace=True but none of these are working. Please suggest a way to read this text file into pandas dataframe.

Solution

The read_fwf function in pandas can read a file with a table of fixed-width formatted lines into a DataFrame.

The header line confuses the auto-width calculations so best to skip the header lines and explicitly add the column names so in this case the argument skiprows=2 is added.

import pandas as pd

url ='https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt'

df = pd.read_fwf(url, encoding="ansi", skiprows=2,
                 names=['Stations_id', 'von_datum', 'bis_datum', 'Stationshoehe',
                        'geoBreite', 'geoLaenge', 'Stationsname', 'Bundesland'])
print(df)

Output:

     Stations_id  von_datum  bis_datum  Stationshoehe  geoBreite  geoLaenge           Stationsname           Bundesland
0              3   19500401   20110331            202    50.7827     6.0941                 Aachen  Nordrhein-Westfalen
1             44   20070401   20220920             44    52.9336     8.2370           Großenkneten        Niedersachsen
2             52   19760101   19880101             46    53.6623    10.1990   Ahrensburg-Wulfsdorf   Schleswig-Holstein
3             71   20091201   20191231            759    48.2156     8.9784        Albstadt-Badkap    Baden-Württemberg
4             73   20070401   20220920            340    48.6159    13.0506   Aldersbach-Kriestorf               Bayern
..           ...        ...        ...            ...        ...        ...                    ...                  ...
663        19171   20200901   20220920             13    54.0038     9.8553     Hasenkrug-Hardebek   Schleswig-Holstein
664        19172   20200901   20220920             48    54.0246     9.3880                 Wacken   Schleswig-Holstein

[665 rows x 8 columns]

If want to load the file locally and open it then just change the url to the local file name.

df = pd.read_fwf('TU_Stundenwerte_Beschreibung_Stationen.txt', encoding="ansi", skiprows=2,
                 names=['Stations_id', 'von_datum', 'bis_datum', 'Stationshoehe',
                 'geoBreite', 'geoLaenge', 'Stationsname', 'Bundesland'])