Search code examples
pythonpandasdataframedelimiter

Read space separated text file in pandas


I am trying to read a text file present in this url into a pandas dataframe. https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt

It has uneven spacing between columns. I have tried sep='\s+', delim_whitespace=True but none of these are working. Please suggest a way to read this text file into pandas dataframe.


Solution

  • The read_fwf function in pandas can read a file with a table of fixed-width formatted lines into a DataFrame.

    The header line confuses the auto-width calculations so best to skip the header lines and explicitly add the column names so in this case the argument skiprows=2 is added.

    import pandas as pd
    
    url ='https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt'
    
    df = pd.read_fwf(url, encoding="ansi", skiprows=2,
                     names=['Stations_id', 'von_datum', 'bis_datum', 'Stationshoehe',
                            'geoBreite', 'geoLaenge', 'Stationsname', 'Bundesland'])
    print(df)
    

    Output:

         Stations_id  von_datum  bis_datum  Stationshoehe  geoBreite  geoLaenge           Stationsname           Bundesland
    0              3   19500401   20110331            202    50.7827     6.0941                 Aachen  Nordrhein-Westfalen
    1             44   20070401   20220920             44    52.9336     8.2370           Großenkneten        Niedersachsen
    2             52   19760101   19880101             46    53.6623    10.1990   Ahrensburg-Wulfsdorf   Schleswig-Holstein
    3             71   20091201   20191231            759    48.2156     8.9784        Albstadt-Badkap    Baden-Württemberg
    4             73   20070401   20220920            340    48.6159    13.0506   Aldersbach-Kriestorf               Bayern
    ..           ...        ...        ...            ...        ...        ...                    ...                  ...
    663        19171   20200901   20220920             13    54.0038     9.8553     Hasenkrug-Hardebek   Schleswig-Holstein
    664        19172   20200901   20220920             48    54.0246     9.3880                 Wacken   Schleswig-Holstein
    
    [665 rows x 8 columns]
    

    If want to load the file locally and open it then just change the url to the local file name.

    df = pd.read_fwf('TU_Stundenwerte_Beschreibung_Stationen.txt', encoding="ansi", skiprows=2,
                     names=['Stations_id', 'von_datum', 'bis_datum', 'Stationshoehe',
                     'geoBreite', 'geoLaenge', 'Stationsname', 'Bundesland'])