Search code examples
pythonpandasdataframewhitespacedelimiter

Creating Pandas Dataframe from textfile of long term Climate Data


I have a Textfile (.DAT) file, in which there is daily climate data of a station,

This is the URL of Dataset

daily_data_file=r"..\25_may24_SD.DAT"

df = pd.read_csv(daily_data_file, skiprows=[5], delimiter=r"\s+", names=['YEAR', 'DATE', 'JAN', 'FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])

It creates the data frame but

as in case, some month has 31 days, some have 30 and February has either 28 or 29

but as the whitespace being omitted/delimited

the last 3 columns at end of each month got shifted to left of dataframe as in here in the output leaving NaN values at the end.

In [4]: df
Out [4]: 

         YEAR   DATE    JAN   FEB   MAR  APR    MAY  JUN    JUL     AUG  SEP    OCT  NOV    DEC
0        YEAR   DATE    JAN  FEB    MAR  APR    MAY  JUN    JUL     AUG  SEP    OCT  NOV    DEC
1        1901   1       0.0  0.0    0.3  0.0    3.7  0.9    11.1    0.1  2.5    0.0  0.0    0.0
2        1901   2       0.0  0.0    16.5 0.0    12.3 0.0    11.4    2.7  4.9    0.0  0.0    0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3803     2019   27      0.0  0.0    0.0  0.0    0.0  4.4    12.9    1.1  10.2   6.8  0.0    0.0
3804     2019   28      0.0  0.0    0.0  0.1    0.0  6.0    7.3     0.1  0.3    9.8  0.0    0.0
3805     2019   29      0.0  0.0    0.0  0.0    7.5  7.5    0.6     0.8  8.3    0.0  0.0    NaN
3806    2019    30      0.0  0.0    0.0  0.0    10.2 10.0   3.9     2.0  2.3    0.0  0.0    NaN
3807    2019    31      0.0  0.0    0.0  15.7   24.0 4.5    1.2     NaN  NaN    NaN  NaN    NaN

How the text file should be delimited so that data stays in original form i.e NaN value on 29th, 30th and 31st of the each month in the respective column,

instead of them shifting to the left of dataframe.

The format of data in text file is like this.

TextFile


Solution

    • This type of data is better handled by read_fwf()
    • to make infer work as wanted, have given it 32 lines of fixed format data
    • once all data is in dataframe, cleanup by testing YEAR is numeric to exclude the blank lines and header lines at multiple points in data
    • finally set expected data types on all columns
    import requests
    import pandas as pd
    import numpy as np
    import io
    from pathlib import Path
    
    # download sample data and save to file...
    url = "https://raw.githubusercontent.com/abhilashsinghimd/AASD_Geojson/main/25_may24_SD1.DAT"
    res = requests.get(url)
    with open(Path.cwd().joinpath("SO_example.DAT"), "w") as f: f.write(res.text)
        
    # read file from your file system here...
    with open(Path.cwd().joinpath("SO_example.DAT"), "r") as f: text = f.read()
        
    df = pd.read_fwf(
        io.StringIO(
            "\n".join(text.split("\n")[6:7] + text.split("\n")[8 : 8 + 31])
            + "\n".join(text.split("\n")[8+31:])
        ),
        infer_nrows=32,
    )
    
    print(f"expected row count:{(2019-1900)*31}")
    # exclude header rows littered through data
    df = df.loc[~pd.to_numeric(df["YEAR"], errors="coerce").isna()]
    # convert to expected datatypes
    df = df.assign(**{c:df.loc[:,c].astype("int" if c in ["YEAR","DATE"] else "float") for c in df.columns})
    
    pd.set_option("display.width",100)
    print(df)
    
    

    output

    expected row count:3689
          YEAR  DATE  JAN  FEB   MAR  APR   MAY   JUN   JUL   AUG   SEP  OCT  NOV  DEC
    0     1901     1  0.0  0.0   0.3  0.0   3.7   0.9  11.1   0.1   2.5  0.0  0.0  0.0
    1     1901     2  0.0  0.0  16.5  0.0  12.3   0.0  11.4   2.7   4.9  0.0  0.0  0.0
    2     1901     3  0.0  0.0   0.0  0.0   1.2   0.0   1.3   1.9   0.6  0.0  0.0  0.0
    3     1901     4  0.0  0.0   0.0  0.0   1.2   0.0   7.6  20.5   2.5  0.0  0.0  0.0
    4     1901     5  0.0  0.0   0.0  1.9   0.0   0.0  18.7  41.4   2.6  0.0  0.0  0.0
    ...    ...   ...  ...  ...   ...  ...   ...   ...   ...   ...   ...  ...  ...  ...
    4156  2019    27  0.0  0.0   0.0  0.0   0.0   4.4  12.9   1.1  10.2  6.8  0.0  0.0
    4157  2019    28  0.0  0.0   0.0  0.1   0.0   6.0   7.3   0.1   0.3  9.8  0.0  0.0
    4158  2019    29  0.0  NaN   0.0  0.0   0.0   7.5   7.5   0.6   0.8  8.3  0.0  0.0
    4159  2019    30  0.0  NaN   0.0  0.0   0.0  10.2  10.0   3.9   2.0  2.3  0.0  0.0
    4160  2019    31  0.0  NaN   0.0  NaN   0.0   NaN  15.7  24.0   NaN  4.5  NaN  1.2
    
    [3689 rows x 14 columns]