Search code examples
pythonpandasgroup-bygaps-and-islands

How to use Pandas to solve Gap and Island problem with dates?


I am hoping someone out there can help. I am trying to figure out how to solve gap and island problems with pandas, but have been unsuccessful. For a sample data set I am using:

np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
                   'group': 'A',
                   'value': np.random.randint(0, 100, size=len(dates))
                  }).append(pd.DataFrame({'date': dates,
                                          'group': 'B',
                                          'value': np.random.randint(0, 100, size=len(dates))
                  })).append(pd.DataFrame({'date': dates,
                                           'group': 'C',
                                           'value': np.random.randint(0, 100, size=len(dates))
                                          })).reset_index(drop=True)
length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df
    date    group   value
0   2010-01-01  A   57
1   2010-01-02  A   11
2   2010-01-03  A   83
3   2010-01-04  A   83
4   2010-01-05  A   93
... ... ... ...
992 2010-12-27  C   50
993 2010-12-28  C   59
994 2010-12-29  C   85
995 2010-12-30  C   32
996 2010-12-31  C   3

I would like to identify islands with breaks in dates of > 1 day to start start a new island.

Any help is appreciated!

Not sure where to start with this one.


Solution

  • My assumption is that you want to identify the islands by group. I see two versions to do that: (1) You number the islands groupwise, i.e. an island number can occur several times. (2) You number overall. Therefore 2 proposals:

    groupwise:

    df = (
        df.sort_values(["group", "date"])
        .assign(island=lambda df:
            df.groupby("group")["date"].diff().dt.days.fillna(2).gt(1)
            .groupby(df["group"]).cumsum())
    )
    

    overall:

    df = (
        df.sort_values(["group", "date"])
        .assign(island=lambda df:
            df.groupby("group")["date"].diff().dt.days.fillna(2).gt(1).cumsum())
    )
    

    After sorting along group and date (not necessary for the example) group by group and take the diffs over date: this results in timedeltas. Then extract the days, fill the NaNs at the beginning of the groups with 2, and check where the number of days is greater than 1: this results in booleans. Now use .cumsum, according to the counting preferences, to indentify the islands.

    Results for the following sample dataframe

    rng = np.random.default_rng(12345678)
    dates = pd.date_range(start="2010-01-01", end="2010-12-31", freq="D")
    dfs = (pd.DataFrame({"date": dates, "group": group,
                         "value": rng.integers(0, 100, len(dates))})
           for group in "ABC")
    df = pd.concat(dfs, ignore_index=True)
    df = df.drop(rng.integers(0, df.shape[0], 300)).reset_index(drop=True)
    

    looks like

              date group  value  island
    0   2010-01-01     A     96       1
    1   2010-01-02     A     95       1
    2   2010-01-04     A     87       2
    3   2010-01-05     A     28       2
    4   2010-01-06     A     35       2
    ..         ...   ...    ...     ...
    825 2010-12-26     C     51      65
    826 2010-12-28     C     73      66
    827 2010-12-29     C     50      66
    828 2010-12-30     C     48      66
    829 2010-12-31     C     28      66
    
    [830 rows x 4 columns]
    

    or

              date group  value  island
    0   2010-01-01     A     96       1
    1   2010-01-02     A     95       1
    2   2010-01-04     A     87       2
    3   2010-01-05     A     28       2
    4   2010-01-06     A     35       2
    ..         ...   ...    ...     ...
    825 2010-12-26     C     51     205
    826 2010-12-28     C     73     206
    827 2010-12-29     C     50     206
    828 2010-12-30     C     48     206
    829 2010-12-31     C     28     206
    
    [830 rows x 4 columns]