I have a Textfile (.DAT) file, in which there is daily climate data of a station,
daily_data_file=r"..\25_may24_SD.DAT"
df = pd.read_csv(daily_data_file, skiprows=[5], delimiter=r"\s+", names=['YEAR', 'DATE', 'JAN', 'FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])
It creates the data frame but
as in case, some month has 31 days, some have 30 and February has either 28 or 29
but as the whitespace being omitted/delimited
the last 3 columns at end of each month got shifted to left of dataframe as in here in the output leaving NaN values at the end.
In [4]: df
Out [4]:
YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
0 YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1 1901 1 0.0 0.0 0.3 0.0 3.7 0.9 11.1 0.1 2.5 0.0 0.0 0.0
2 1901 2 0.0 0.0 16.5 0.0 12.3 0.0 11.4 2.7 4.9 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3803 2019 27 0.0 0.0 0.0 0.0 0.0 4.4 12.9 1.1 10.2 6.8 0.0 0.0
3804 2019 28 0.0 0.0 0.0 0.1 0.0 6.0 7.3 0.1 0.3 9.8 0.0 0.0
3805 2019 29 0.0 0.0 0.0 0.0 7.5 7.5 0.6 0.8 8.3 0.0 0.0 NaN
3806 2019 30 0.0 0.0 0.0 0.0 10.2 10.0 3.9 2.0 2.3 0.0 0.0 NaN
3807 2019 31 0.0 0.0 0.0 15.7 24.0 4.5 1.2 NaN NaN NaN NaN NaN
How the text file should be delimited so that data stays in original form i.e NaN value on 29th, 30th and 31st of the each month in the respective column,
instead of them shifting to the left of dataframe.
The format of data in text file is like this.
import requests
import pandas as pd
import numpy as np
import io
from pathlib import Path
# download sample data and save to file...
url = "https://raw.githubusercontent.com/abhilashsinghimd/AASD_Geojson/main/25_may24_SD1.DAT"
res = requests.get(url)
with open(Path.cwd().joinpath("SO_example.DAT"), "w") as f: f.write(res.text)
# read file from your file system here...
with open(Path.cwd().joinpath("SO_example.DAT"), "r") as f: text = f.read()
df = pd.read_fwf(
io.StringIO(
"\n".join(text.split("\n")[6:7] + text.split("\n")[8 : 8 + 31])
+ "\n".join(text.split("\n")[8+31:])
),
infer_nrows=32,
)
print(f"expected row count:{(2019-1900)*31}")
# exclude header rows littered through data
df = df.loc[~pd.to_numeric(df["YEAR"], errors="coerce").isna()]
# convert to expected datatypes
df = df.assign(**{c:df.loc[:,c].astype("int" if c in ["YEAR","DATE"] else "float") for c in df.columns})
pd.set_option("display.width",100)
print(df)
expected row count:3689
YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
0 1901 1 0.0 0.0 0.3 0.0 3.7 0.9 11.1 0.1 2.5 0.0 0.0 0.0
1 1901 2 0.0 0.0 16.5 0.0 12.3 0.0 11.4 2.7 4.9 0.0 0.0 0.0
2 1901 3 0.0 0.0 0.0 0.0 1.2 0.0 1.3 1.9 0.6 0.0 0.0 0.0
3 1901 4 0.0 0.0 0.0 0.0 1.2 0.0 7.6 20.5 2.5 0.0 0.0 0.0
4 1901 5 0.0 0.0 0.0 1.9 0.0 0.0 18.7 41.4 2.6 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4156 2019 27 0.0 0.0 0.0 0.0 0.0 4.4 12.9 1.1 10.2 6.8 0.0 0.0
4157 2019 28 0.0 0.0 0.0 0.1 0.0 6.0 7.3 0.1 0.3 9.8 0.0 0.0
4158 2019 29 0.0 NaN 0.0 0.0 0.0 7.5 7.5 0.6 0.8 8.3 0.0 0.0
4159 2019 30 0.0 NaN 0.0 0.0 0.0 10.2 10.0 3.9 2.0 2.3 0.0 0.0
4160 2019 31 0.0 NaN 0.0 NaN 0.0 NaN 15.7 24.0 NaN 4.5 NaN 1.2
[3689 rows x 14 columns]