I am hoping someone out there can help. I am trying to figure out how to solve gap and island problems with pandas, but have been unsuccessful. For a sample data set I am using:
np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
'group': 'A',
'value': np.random.randint(0, 100, size=len(dates))
}).append(pd.DataFrame({'date': dates,
'group': 'B',
'value': np.random.randint(0, 100, size=len(dates))
})).append(pd.DataFrame({'date': dates,
'group': 'C',
'value': np.random.randint(0, 100, size=len(dates))
})).reset_index(drop=True)
length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df
date group value
0 2010-01-01 A 57
1 2010-01-02 A 11
2 2010-01-03 A 83
3 2010-01-04 A 83
4 2010-01-05 A 93
... ... ... ...
992 2010-12-27 C 50
993 2010-12-28 C 59
994 2010-12-29 C 85
995 2010-12-30 C 32
996 2010-12-31 C 3
I would like to identify islands with breaks in dates of > 1 day to start start a new island.
Any help is appreciated!
Not sure where to start with this one.
My assumption is that you want to identify the islands by group
. I see two versions to do that: (1) You number the islands groupwise, i.e. an island number can occur several times. (2) You number overall. Therefore 2 proposals:
groupwise:
df = (
df.sort_values(["group", "date"])
.assign(island=lambda df:
df.groupby("group")["date"].diff().dt.days.fillna(2).gt(1)
.groupby(df["group"]).cumsum())
)
overall:
df = (
df.sort_values(["group", "date"])
.assign(island=lambda df:
df.groupby("group")["date"].diff().dt.days.fillna(2).gt(1).cumsum())
)
After sorting along group
and date
(not necessary for the example) group by group
and take the diff
s over date
: this results in timedelta
s. Then extract the days, fill the NaN
s at the beginning of the groups with 2
, and check where the number of days is greater than 1: this results in booleans. Now use .cumsum
, according to the counting preferences, to indentify the islands.
Results for the following sample dataframe
rng = np.random.default_rng(12345678)
dates = pd.date_range(start="2010-01-01", end="2010-12-31", freq="D")
dfs = (pd.DataFrame({"date": dates, "group": group,
"value": rng.integers(0, 100, len(dates))})
for group in "ABC")
df = pd.concat(dfs, ignore_index=True)
df = df.drop(rng.integers(0, df.shape[0], 300)).reset_index(drop=True)
looks like
date group value island
0 2010-01-01 A 96 1
1 2010-01-02 A 95 1
2 2010-01-04 A 87 2
3 2010-01-05 A 28 2
4 2010-01-06 A 35 2
.. ... ... ... ...
825 2010-12-26 C 51 65
826 2010-12-28 C 73 66
827 2010-12-29 C 50 66
828 2010-12-30 C 48 66
829 2010-12-31 C 28 66
[830 rows x 4 columns]
or
date group value island
0 2010-01-01 A 96 1
1 2010-01-02 A 95 1
2 2010-01-04 A 87 2
3 2010-01-05 A 28 2
4 2010-01-06 A 35 2
.. ... ... ... ...
825 2010-12-26 C 51 205
826 2010-12-28 C 73 206
827 2010-12-29 C 50 206
828 2010-12-30 C 48 206
829 2010-12-31 C 28 206
[830 rows x 4 columns]