Search code examples
pythonpandasgaps-and-islands

Gaps and Islands Python dataframe pandas


I have the following problem which as I read, it's mentioned as the 'gaps and islands' problem. A sample of my data are like the below (I excluded some columns related to the user also, but I suppose the approach would be the same):

user reports_to date
john mark 1/1/2021
john mark 2/1/2021
john mark 3/1/2021
john mark 4/1/2021
john mark 5/1/2021
john mark 6/1/2021
john mark 7/1/2021
john mark 8/1/2021
john mark 9/1/2021
john mark 10/1/2021
john mark 11/1/2021
john nick 12/1/2021
john nick 13/1/2021
john nick 14/1/2021
john nick 15/1/2021
john nick 16/1/2021
john nick 17/1/2021
john nick 18/1/2021
john nick 19/1/2021
john nick 20/1/2021
john mark 21/1/2021
john mark 22/1/2021
john mark 23/1/2021
john mark 24/1/2021
john mark 25/1/2021

My dataset contains thousands of rows and couple more columns that are related to the characteristics of the user. That pattern may exist in multiple users.

When I try to find the min and max date that a user had that manager using a simple group by, I get the following result:

user reports_to from to
john mark 1/1/2021 25/1/2021
john nick 12/1/2021 20/1/2021

Logically, that result is wrong. The desired result would be the following:

user reports_to from to
john mark 1/1/2021 11/1/2021
john nick 12/1/2021 20/1/2021
john mark 21/1/2021 25/1/2021

My project is on python and I try to solve it using window_functions or rank functions like the following

df["rank"] = df.groupby(['user','reports_to'])["date"].rank("dense", ascending=True)

in order to compare the rank vs the previous date, but I still cannot solve that problem. Which approach would be the best and thank you in advance for your time.


Solution

  • The code is here.

    import pandas as pd
    from io import StringIO
    df = '''
    user    reports_to  date
    john    mark    1/1/2021
    john    mark    2/1/2021
    john    mark    3/1/2021
    john    mark    4/1/2021
    john    mark    5/1/2021
    john    mark    6/1/2021
    john    mark    7/1/2021
    john    mark    8/1/2021
    john    mark    9/1/2021
    john    mark    10/1/2021
    john    mark    11/1/2021
    john    nick    12/1/2021
    john    nick    13/1/2021
    john    nick    14/1/2021
    john    nick    15/1/2021
    john    nick    16/1/2021
    john    nick    17/1/2021
    john    nick    18/1/2021
    john    nick    19/1/2021
    john    nick    20/1/2021
    john    mark    21/1/2021
    john    mark    22/1/2021
    john    mark    23/1/2021
    john    mark    24/1/2021
    john    mark    25/1/2021
    '''
    inf=StringIO(df)
    df = pd.read_csv(inf,sep="\t")
    
    
    df['group'] = (( (df.user != df.user.shift() ) | (df.reports_to != df.reports_to.shift())  ).cumsum())
    
    # must do this 
    df['date'] = pd.to_datetime(df['date']) 
    
    result = ( df.groupby(['user','group','reports_to'])['date']
        .agg({min, max})
        .rename(columns={'min': 'from','max':'to'})
        .reset_index() )