I have the following problem which as I read, it's mentioned as the 'gaps and islands' problem. A sample of my data are like the below (I excluded some columns related to the user also, but I suppose the approach would be the same):
user | reports_to | date |
---|---|---|
john | mark | 1/1/2021 |
john | mark | 2/1/2021 |
john | mark | 3/1/2021 |
john | mark | 4/1/2021 |
john | mark | 5/1/2021 |
john | mark | 6/1/2021 |
john | mark | 7/1/2021 |
john | mark | 8/1/2021 |
john | mark | 9/1/2021 |
john | mark | 10/1/2021 |
john | mark | 11/1/2021 |
john | nick | 12/1/2021 |
john | nick | 13/1/2021 |
john | nick | 14/1/2021 |
john | nick | 15/1/2021 |
john | nick | 16/1/2021 |
john | nick | 17/1/2021 |
john | nick | 18/1/2021 |
john | nick | 19/1/2021 |
john | nick | 20/1/2021 |
john | mark | 21/1/2021 |
john | mark | 22/1/2021 |
john | mark | 23/1/2021 |
john | mark | 24/1/2021 |
john | mark | 25/1/2021 |
My dataset contains thousands of rows and couple more columns that are related to the characteristics of the user. That pattern may exist in multiple users.
When I try to find the min and max date that a user had that manager using a simple group by, I get the following result:
user | reports_to | from | to |
---|---|---|---|
john | mark | 1/1/2021 | 25/1/2021 |
john | nick | 12/1/2021 | 20/1/2021 |
Logically, that result is wrong. The desired result would be the following:
user | reports_to | from | to |
---|---|---|---|
john | mark | 1/1/2021 | 11/1/2021 |
john | nick | 12/1/2021 | 20/1/2021 |
john | mark | 21/1/2021 | 25/1/2021 |
My project is on python and I try to solve it using window_functions or rank functions like the following
df["rank"] = df.groupby(['user','reports_to'])["date"].rank("dense", ascending=True)
in order to compare the rank vs the previous date, but I still cannot solve that problem. Which approach would be the best and thank you in advance for your time.
The code is here.
import pandas as pd
from io import StringIO
df = '''
user reports_to date
john mark 1/1/2021
john mark 2/1/2021
john mark 3/1/2021
john mark 4/1/2021
john mark 5/1/2021
john mark 6/1/2021
john mark 7/1/2021
john mark 8/1/2021
john mark 9/1/2021
john mark 10/1/2021
john mark 11/1/2021
john nick 12/1/2021
john nick 13/1/2021
john nick 14/1/2021
john nick 15/1/2021
john nick 16/1/2021
john nick 17/1/2021
john nick 18/1/2021
john nick 19/1/2021
john nick 20/1/2021
john mark 21/1/2021
john mark 22/1/2021
john mark 23/1/2021
john mark 24/1/2021
john mark 25/1/2021
'''
inf=StringIO(df)
df = pd.read_csv(inf,sep="\t")
df['group'] = (( (df.user != df.user.shift() ) | (df.reports_to != df.reports_to.shift()) ).cumsum())
# must do this
df['date'] = pd.to_datetime(df['date'])
result = ( df.groupby(['user','group','reports_to'])['date']
.agg({min, max})
.rename(columns={'min': 'from','max':'to'})
.reset_index() )