I would kindly ask you to help me to clarify a doubt about Data Analysis with Python
.
I am working on a DataFrame
that displays sales data from a fictional company. In order to identify which pairs of products are most often sold together, I created a column named 'Grouped', which groups the products that were registered with the same 'Order ID' (each item is separated by a comma).
After that, I used Counter()
from collections
and combinations
from itertools
to count how many times each pair appeared in the 'Grouped' column. Please, check the following code:
from itertools import combinations
from collections import Counter
count_pairs=Counter()
for row in df['Grouped']:
row_list=row.split(',')
count_pairs.update(Counter(combinations(row_list,2))) # I want to count pairs of two
count_pairs.most_common(5)
This is the output that was generated:
[(('iPhone', 'Lightning Charging Cable'), 1004),
(('Google Phone', 'USB-C Charging Cable'), 987),
(('iPhone', 'Wired Headphones'), 447),
(('Google Phone', 'Wired Headphones'), 414),
(('Vareebadd Phone', 'USB-C Charging Cable'), 361)]
Now, I want to identify what are the products that appear the most in this list, that is, how many times each item appears in the available pairs.
I tried to analyze them individually with the following code:
['Lightning Charging Cable' in x for x in count_pairs].count(True) #output = 37
However, I want to create "a list or a series" and sort the results by the number of times each item appears in the available pairs.
Do you know a way to solve this?
Thank you very much for your help!
Additional info:
As some of you used the top-five pairs (count_pairs.most_common(5)) to solve this problem, I would like to inform that I will need to count the elements based on a 'Counter object', which is 'count_pairs':
count_pairs
The summarized output of 'count_pairs' is:
Counter({('Google Phone', 'Wired Headphones'): 414,
('Google Phone', 'USB-C Charging Cable'): 987,
('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27,
('AAA Batteries (4-pack)', 'Google Phone'): 11,
('Lightning Charging Cable', 'USB-C Charging Cable'): 58,....})
#the original output has 313 pairs
Here's how I would do this:
from collections import Counter
import pandas as pd
data = [(('iPhone', 'Lightning Charging Cable'), 1004),
(('Google Phone', 'USB-C Charging Cable'), 987),
(('iPhone', 'Wired Headphones'), 447),
(('Google Phone', 'Wired Headphones'), 414),
(('Vareebadd Phone', 'USB-C Charging Cable'), 361)]
df = pd.DataFrame(sorted(Counter([x[0][1] for x in data]).items(), key= lambda x : x[1]))
To apply the same transformation to the original count_pairs
data structure, just remove the [0]
from x[0][1]
, like this:
df = pd.DataFrame(sorted(Counter([x[1] for x in count_pairs]).items(), key= lambda x : x[1]))
The effective difference between the two structures is that data
has an extra outer list
that doesn't exist in count_pairs
.