I have two major problems, and I can't imagine the solution in python. Now, I explain you the context. On the one hand I have a dataset, containing some date point with ID (1 ID = 1 patient) like this :
ID | Date point |
---|---|
0001 | 25/12/2022 09:00 |
0002 | 29/12/2022 16:00 |
0003 | 30/12/2022 18:00 |
... | .... |
And on the other hand, i have a folder with many text files containing the times series, like this :
0001.txt
0002.txt
0003.txt
...
The files have the same architecture : the ID (same as the dataset) is in the name of the file, and inside the file is structured like that (first column contains the date and the second de value) :
25/12/2022 09:00 155
25/12/2022 09:01 156
25/12/2022 09:02 157
25/12/2022 09:03 158
...
1/ I would like to truncate the text files and retrieve only the variables prior to the 48H dataset Date point.
2/ To make some statistical analysis, I want to take some value like the mean or the maximum of this variables and add in a dataframe like this :
ID | Mean | Maximum |
---|---|---|
0001 | ||
0002 | ||
0003 | ||
... | .... | ... |
I know for you it will be a trivial problem, but for me (a beginner in python code) it will be a challenge !
Thank you everybody.
Manage time series with a dataframe containing date point and take some statistical values.
You could do something along these lines using pandas (I've not been able to test this fully):
import pandas as pd
from pathlib import Path
# I'll create a limited version of your initial table
data = {
"ID": ["0001", "0002", "0003"],
"Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}
# put in a Pandas DataFrame
df = pd.DataFrame(data)
# convert the "Date point" column to a datetime object
df["Date point"] = pd.to_datetime(df["Date point"])
# provide the path to the folder containing the files
folder = Path("/path_to_files")
newdata = {"ID": [], "Mean": [], "Maximum": []} # an empty dictionary that you'll fill with the required statistical info
# loop through the IDs and read in the files
for i, date in zip(df["ID"], df["Date point"]):
inputfile = folder / f"{i}.txt" # construct file name
if inputfile.exists():
# read in the file
subdata = pd.read_csv(
inputfile,
sep="\s+", # columns are separated by spaces
header=None, # there's no header information
parse_dates=[[0, 1]], # the first and second columns should be combined and converted to datetime objects
infer_datetime_format=True
)
# get the values 48 hours after the current date point
td = pd.Timedelta(value=48, unit="hours")
mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)
# add in the required info
newdata["ID"].append(i)
newdata["Mean"].append(subdata[2].loc[mask].mean())
newdata["Maximum"].append(subdata[2].loc[mask].max())
# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)