I'm trying to create a list of active subscriptions in pandas, but all of my solutions seem too convoluted and most likely very slow, so I was wondering if there is an easier and/or better way to make it do what I want it to do.
My data looks something like this:
User | Subscription Start | Subscription Length | Subscription End |
---|---|---|---|
John | 2023-01-01 | 30 | 2023-01-31 |
John | 2023-01-15 | 30 | 2023-02-14 |
Jack | 2023-02-01 | 120 | 2023-06-01 |
Jack | 2022-08-01 | 30 | 2023-08-31 |
Jeff | 2023-01-01 | 30 | 2023-01-31 |
Users can buy multiple subscriptions and they add together, and users can also let their subscriptions stop and buy it at a later date. Ideally what I would like the output to be is to have one line for each 'subscription instant' for each user, so it would look something like this:
User | Subscription Start | Subscription Length | Subscription End |
---|---|---|---|
John | 2023-01-01 | 60 | 2023-03-02 |
Jack | 2023-02-01 | 120 | 2023-06-01 |
Jack | 2022-08-01 | 30 | 2023-08-31 |
Jeff | 2023-01-01 | 30 | 2023-01-31 |
I tried to this with a for loop, where if a user has not had a subscription then they would be added to the list and if they did have one before it would check if the new subscription's start date falls within any of the subscription ranges already in the list and then adds on top of one of them, but I had way too many for and if statements, and I'm sure there is a more elegant way to do it.
So after some thinking, I managed to come up with this, it works good enough:
li = df.iloc[0:0]
for x in df.index:
if df.iloc[x,0] not in li.User.unique():
li = li.append(df.iloc[x])
else:
if len(li[(li.User == df.iloc[x,0]) & (li.end_date >= df.iloc[x,2]) & (li.start_date <= df.iloc[x,2])]) > 0:
li.loc[(li.clientid == df.iloc[x,0]) & (li.end_date >= df.iloc[x,2]) & (li.start_date <= df.iloc[x,2]),'sub_length'] += df.iloc[x,3]
else:
li = li.append(df.iloc[x])
I've found that this only works perfectly if the data is in chronological order. I've also made a for loop to check through the list for any errors:
li = li.reset_index(drop=True)
log2 = pd.DataFrame(columns=['Outcome','value'])
log2 = log2.iloc[0:0]
li2 = li.iloc[0:0]
for x in li.index:
if len(li[(li.User== li.iloc[x,0]) & (li.end_date >= li.iloc[x,2]) & (li.start_date <= li.iloc[x,2])]) > 1:
li2 = li2.append(li.iloc[x])
log2 = log2.append({'Outcome':'Bad','value':1},ignore_index=True)
else:
log2 = log2.append({'Outcome':'Good','value':1},ignore_index=True)
log2.groupby('Outcome').count()
For this exact code to work the columns should be (in order): User platform start_date sub_length end_date