Input format:
CUSTOMER MONTH ISSUE
1 M1 ABC
1 M1 DEF
1 M2 ABC
1 M3 QRS
2 M1 PQR
2 M2 PQR
2 M2 ABC
2 M3 DEF
Desired output format:
CUSTOMER M1 M2 M3
1 ABC ABC QRS
1 DEF ABC QRS
2 PQR PQR DEF
2 PQR ABC DEF
I want to find out the possible distinct combinations at customer level and eventually sum of to get a count of such patterns across customers.
I can achieve the above using SQL joins on the same table, but the data is very huge & the number of combinations is very high. So it isn't efficient.
Pivot in Python & SQL wouldn't consider these duplicates. Is there any other solution I could try?
Thank you
I can achieve the above using SQL joins on the same table, but the data is very huge & the number of combinations is very high. So it isn't efficient.
Pivot in Python & SQL wouldn't consider these duplicates. Pivot error in Python: ValueError: Index contains duplicate entries, cannot reshape
In python you can achieve it using underneath segment a sample. You need to pass your data to the dataframe and will need to tweek the pivot rows and groupby column
import pandas as pd
data = {
'CUSTOMER': [1, 2, 2],
'MONTH': ['M1', 'M1', 'M2'],
'ISSUE': ['ABC', 'DEF', 'ABC']
}
df = pd.DataFrame(data)
df['issue_count'] = df.groupby(['CUSTOMER', 'MONTH']).cumcount()
pivot_df = df.pivot_table(index=['CUSTOMER', 'issue_count'], columns='MONTH', values='ISSUE', aggfunc='first').reset_index()
pivot_df = pivot_df.drop('issue_count', axis=1).fillna('')#droping the enumeration
pivot_df = pivot_df[['CUSTOMER', 'M1', 'M2']]
print(pivot_df)
print("\n\n")
pattern_counts = pivot_df.groupby(['M1', 'M2']).size().reset_index(name='Count')
print(pattern_counts)
Output:
MONTH CUSTOMER M1 M2
0 1 ABC
1 2 DEF ABC
M1 M2 Count
0 ABC 1
1 DEF ABC 1
To get complete list of combination
import pandas as pd
from itertools import product
data = {
'CUSTOMER': [1, 2, 2],
'MONTH': ['M1', 'M1', 'M2'],
'ISSUE': ['ABC', 'DEF', 'ABC']
}
df = pd.DataFrame(data)
unique_issues = {month: df[df['MONTH'] == month]['ISSUE'].unique() for month in df['MONTH'].unique()}
all_combinations = []
for customer in df['CUSTOMER'].unique():
for combination in product(*[unique_issues[month] for month in unique_issues]):
combo_dict = {'CUSTOMER': customer}
combo_dict.update({f'M{i+1}': issue for i, issue in enumerate(combination)})
all_combinations.append(combo_dict)
all_combinations_df = pd.DataFrame(all_combinations)
pattern_counts = all_combinations_df.groupby(list(unique_issues.keys())).size().reset_index(name='Count')
print(all_combinations_df)
print("\n\n")
print(pattern_counts)
Output for above:
CUSTOMER M1 M2
0 1 ABC ABC
1 1 DEF ABC
2 2 ABC ABC
3 2 DEF ABC
M1 M2 Count
0 ABC ABC 2
1 DEF ABC 2