How to use Pandas to solve Gap and Island problem with dates?

I am hoping someone out there can help. I am trying to figure out how to solve gap and island problems with pandas, but have been unsuccessful. For a sample data set I am using:

np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
                   'group': 'A',
                   'value': np.random.randint(0, 100, size=len(dates))
                  }).append(pd.DataFrame({'date': dates,
                                          'group': 'B',
                                          'value': np.random.randint(0, 100, size=len(dates))
                  })).append(pd.DataFrame({'date': dates,
                                           'group': 'C',
                                           'value': np.random.randint(0, 100, size=len(dates))
                                          })).reset_index(drop=True)
length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df

    date    group   value
0   2010-01-01  A   57
1   2010-01-02  A   11
2   2010-01-03  A   83
3   2010-01-04  A   83
4   2010-01-05  A   93
... ... ... ...
992 2010-12-27  C   50
993 2010-12-28  C   59
994 2010-12-29  C   85
995 2010-12-30  C   32
996 2010-12-31  C   3

I would like to identify islands with breaks in dates of > 1 day to start start a new island.

Any help is appreciated!

Not sure where to start with this one.

Solution

My assumption is that you want to identify the islands by group. I see two versions to do that: (1) You number the islands groupwise, i.e. an island number can occur several times. (2) You number overall. Therefore 2 proposals:

groupwise:

df = (
    df.sort_values(["group", "date"])
    .assign(island=lambda df:
        df.groupby("group")["date"].diff().dt.days.fillna(2).gt(1)
        .groupby(df["group"]).cumsum())
)

overall:

df = (
    df.sort_values(["group", "date"])
    .assign(island=lambda df:
        df.groupby("group")["date"].diff().dt.days.fillna(2).gt(1).cumsum())
)

After sorting along group and date (not necessary for the example) group by group and take the diffs over date: this results in timedeltas. Then extract the days, fill the NaNs at the beginning of the groups with 2, and check where the number of days is greater than 1: this results in booleans. Now use .cumsum, according to the counting preferences, to indentify the islands.

Results for the following sample dataframe

rng = np.random.default_rng(12345678)
dates = pd.date_range(start="2010-01-01", end="2010-12-31", freq="D")
dfs = (pd.DataFrame({"date": dates, "group": group,
                     "value": rng.integers(0, 100, len(dates))})
       for group in "ABC")
df = pd.concat(dfs, ignore_index=True)
df = df.drop(rng.integers(0, df.shape[0], 300)).reset_index(drop=True)

looks like

          date group  value  island
0   2010-01-01     A     96       1
1   2010-01-02     A     95       1
2   2010-01-04     A     87       2
3   2010-01-05     A     28       2
4   2010-01-06     A     35       2
..         ...   ...    ...     ...
825 2010-12-26     C     51      65
826 2010-12-28     C     73      66
827 2010-12-29     C     50      66
828 2010-12-30     C     48      66
829 2010-12-31     C     28      66

[830 rows x 4 columns]

          date group  value  island
0   2010-01-01     A     96       1
1   2010-01-02     A     95       1
2   2010-01-04     A     87       2
3   2010-01-05     A     28       2
4   2010-01-06     A     35       2
..         ...   ...    ...     ...
825 2010-12-26     C     51     205
826 2010-12-28     C     73     206
827 2010-12-29     C     50     206
828 2010-12-30     C     48     206
829 2010-12-31     C     28     206

[830 rows x 4 columns]