I have a table with rows being individual visits from users to an online shop. There are multiple columns that are attributes of the visit (could be booleans or categorical with more than 2 possible values). And there is a column that counts how many items were bought during that visit.
I want to create a table which summarises by every attribute-pair the average number of items bought per visit. I.e. for a given attribute pair value, sum up the number of items bought and divide by the number of rows for this attribute pair. I would then use this table to visualise it as a heat map.
This is similar to preparing separate heatmaps for every attribute-pair. But would want to do this in one table or heatmap for easier digestibility.
Note than along the diagonal when the same attributes are being compared the result is always 1, or when mutually exclusive attributes are compared the result is NA. The cells left empty are the same as the ones filled in above the diagonal.
Input table:
visit_id | age | is_website | is_US | items_bought |
---|---|---|---|---|
aa | young | true | true | 0 |
ab | young | false | false | 2 |
ac | old | true | true | 0 |
ad | old | true | false | 3 |
Desired output table:
age young | age old | is_website true | is_website false | is_US true | is_US false | |
---|---|---|---|---|---|---|
age young | 1 | NA | 0 | 2 | 0 | 2 |
age old | NA | 1 | 1.5 | null division | 0 | 3 |
is_website true | 1 | NA | 0 | 3 | ||
is_website false | NA | 1 | null division | 2 | ||
is_US true | 1 | NA | ||||
is_US false | NA | 1 |
What I tried:
data = {
'visit_id': ['aa', 'ab', 'ac', 'ad'],
'age': ['young', 'young', 'old', 'old'],
'is_website': [True, False, True, True],
'is_US': [True, False, True, False],
'items_bought': [0, 2, 0, 3]
}
import pandas as pd
df1 = pd.DataFrame(data)
dim = ['age', 'is_website', 'is_US']
combinations
from itertools
from itertools import combinations
dim_pairs = list(combinations(dim, 2))
total_items_bought
) and count the number of rows (total_visits
)items_bought_per_visit
which is total_items_bought / total_visits
dfs = {}
for x in range(len(dim_pairs)):
grouped = df1.groupby([dim_pairs[x][0], dim_pairs[x][1]]).agg({'items_bought': 'sum', 'visit_id': 'count'}).reset_index()
grouped['items_bought_per_visit'] = grouped['items_bought'] / grouped['visit_id']
pivot_df = grouped.pivot_table(index=dim_pairs[x][0], columns=dim_pairs[x][1], values='items_bought_per_visit', aggfunc='sum').fillna(0)
df_name = f"pivot_df{x}"
dfs[df_name] = pivot_df
So I have a dictionary of dataframes which has all the values I'd need to fill in the single desired output table, but not sure how. Or not sure if there's an easier way.
You could try to first reshape df1
with melt
, then use corr
with a custom function:
tmp = (df1
.melt(df1.columns.difference(dim))
.assign(variable=lambda d: d['variable']+' '+d['value'].astype(str))
.pivot(index='visit_id', columns='variable', values='items_bought')
)
def f(a, b):
m = a == b
return a[m].mean()
out = tmp.corr(f)
Output:
variable age old age young is_US False is_US True is_website False is_website True
variable
age old 1.0 NaN 3.0 0.0 NaN 1.5
age young NaN 1.0 2.0 0.0 2.0 0.0
is_US False 3.0 2.0 1.0 NaN 2.0 3.0
is_US True 0.0 0.0 NaN 1.0 NaN 0.0
is_website False NaN 2.0 2.0 NaN 1.0 NaN
is_website True 1.5 0.0 3.0 0.0 NaN 1.0
Intermediate tmp
:
variable age old age young is_US False is_US True is_website False is_website True
visit_id
aa NaN 0.0 NaN 0.0 NaN 0.0
ab NaN 2.0 2.0 NaN 2.0 NaN
ac 0.0 NaN NaN 0.0 NaN 0.0
ad 3.0 NaN 3.0 NaN NaN 3.0