I have the following Pandas DataFrame:
product_id category number_of_purchase
23 cat1 18
65 cat2 19
66 cat1 4
98 cat1 9
998 cat1 1
798 cat2 8
And from this DataFrame I want to creat this new DataFrame:
category url
cat1 65&23
cat2 65&8
(for each category I want to retrieve the 2 items having the most number of purchase)
In MySQL I would have done:
select
category,
group_concat(product_id order by numbe_of_purchase desc limit2 separator '&')
from my_table
group by category
but I don't know how to group_concat with Pandas DataFrame and also how to use an order by and a limit inside a group_concat.
There's no group concat function in python / pandas, so we'll have to use some groupby. It's a bit longer than SQL, but still relatively short (main part is 3 lines).
Let's create the dataframe :
import pandas as pd
data = {'product_id': [23, 65, 66, 98, 998, 798],
'category': ['cat1', 'cat2', 'cat1', 'cat1', 'cat1', 'cat2'],
'number_of_purchase': [18,19,4,9,1,8]}
df = pd.DataFrame(data)
print df
result :
category number_of_purchase product_id
0 cat1 18 23
1 cat2 19 65
2 cat1 4 66
3 cat1 9 98
4 cat1 1 998
5 cat2 8 798
First step : we sort the dataframe by sales :
df = df.sort(columns='number_of_purchase', ascending=False)
df
result :
category number_of_purchase product_id
1 cat2 19 65
0 cat1 18 23
3 cat1 9 98
5 cat2 8 798
2 cat1 4 66
4 cat1 1 998
Seconde step : We use a groupby operation.For each category, it will create a list of the top two categories. Data is still integer.
df = df.groupby('category').apply(lambda x: list(x.product_id)[:2])
print df
result :
category
cat1 [23, 98]
cat2 [65, 798]
dtype: object
If you need to have the result as a string, we use a simple lambda operation :
df.apply(lambda x: '&'.join([str(elem) for elem in x]))
result :
category
cat1 23&98
cat2 65&798
dtype: object