I have a data frame and a snippet of it is given below.
data = {'ID':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
'Date':['03/25/2021', '03/25/2021','03/27/2021', '03/29/2021', '03/10/2021','03/11/2021','03/15/2021','03/16/2021', '03/21/2021','03/25/2021']}
df = pd.DataFrame(data)
I am looking for a final result which should look like this.
The explanation: For each ID, the study_date starts from the starting date and ends on the last date. The missing dates in the middle have to be filled. If the date was missing from the original dataframe, then 'missing_date' column will have value 1 or else 0. And the study day column is the number of days from the starting to ending days incrementing in order.
If there are multiple rows with same dates, those have to be retained along with the new columns having the same previous data as shown.
I tried some stuff but I have been stuck on this for a while now. Any help is greatly appreciated.
I came up with the code given below. This gets the 'Missing_Date' and 'Studyday'. However, if there are multiple entries with same Dates, then it is not being displayed.
def fn(x):
dr = pd.date_range(x["Date"].min(), x["Date"].max())
out = pd.DataFrame({"Date": dr}, index=range(1, len(dr) + 1))
out["Missing_Date"] = (~out["Date"].isin(x["Date"])).astype(int)
return out
# if the "Date" column is not converted:
df["Date"] = pd.to_datetime(df["Date"])
x = (
df.groupby("ID")
.apply(fn)
.reset_index()
.rename(columns={"level_1": "StudyDay"})
)
print(x)
Thanks.
One way is to groupby
, find the set difference with the missing dates by ID, construct a new df, concat
with the original and finally rank
for StudyDate
:
df["missing"] = 0
df["Date"] = pd.to_datetime(df["Date"])
new = pd.DataFrame([(k, d, 1) for k, v in df.groupby("ID")["Date"]
for d in pd.date_range(min(v), max(v))^v],
columns=df.columns)
df = pd.concat([df, new], ignore_index=True).sort_values(["ID", "Date"]).reset_index(drop=True)
print (df.assign(Studydate=df.groupby('ID')['Date'].rank(method='dense').astype(int)))
ID Date missing Studydate
0 A 2021-03-25 0 1
1 A 2021-03-25 0 1
2 A 2021-03-26 1 2
3 A 2021-03-27 0 3
4 A 2021-03-28 1 4
5 A 2021-03-29 0 5
6 B 2021-03-10 0 1
7 B 2021-03-11 0 2
8 B 2021-03-12 1 3
9 B 2021-03-13 1 4
10 B 2021-03-14 1 5
11 B 2021-03-15 0 6
12 B 2021-03-16 0 7
13 C 2021-03-21 0 1
14 C 2021-03-22 1 2
15 C 2021-03-23 1 3
16 C 2021-03-24 1 4
17 C 2021-03-25 0 5