Search code examples
pythonmysqlpandassql-order-bygroup-concat

Python, use "order by" inside a "group concat" with pandas DataFrame


I have the following Pandas DataFrame:

product_id     category     number_of_purchase
23             cat1         18
65             cat2         19
66             cat1         4
98             cat1         9
998            cat1         1
798            cat2         8

And from this DataFrame I want to creat this new DataFrame:

category     url
cat1         65&23
cat2         65&8

(for each category I want to retrieve the 2 items having the most number of purchase)

In MySQL I would have done:

select
   category,
   group_concat(product_id order by numbe_of_purchase desc limit2 separator '&')
from my_table
group by category

but I don't know how to group_concat with Pandas DataFrame and also how to use an order by and a limit inside a group_concat.


Solution

  • There's no group concat function in python / pandas, so we'll have to use some groupby. It's a bit longer than SQL, but still relatively short (main part is 3 lines).

    Let's create the dataframe :

    import pandas as pd
    
    data = {'product_id': [23, 65, 66, 98, 998, 798],
            'category': ['cat1', 'cat2', 'cat1', 'cat1', 'cat1', 'cat2'],
            'number_of_purchase': [18,19,4,9,1,8]}
    
    df = pd.DataFrame(data)
    print df
    

    result :

      category  number_of_purchase  product_id
    0     cat1                  18          23
    1     cat2                  19          65
    2     cat1                   4          66
    3     cat1                   9          98
    4     cat1                   1         998
    5     cat2                   8         798
    

    First step : we sort the dataframe by sales :

    df = df.sort(columns='number_of_purchase', ascending=False)
    df
    

    result :

      category  number_of_purchase  product_id
    1     cat2                  19          65
    0     cat1                  18          23
    3     cat1                   9          98
    5     cat2                   8         798
    2     cat1                   4          66
    4     cat1                   1         998
    

    Seconde step : We use a groupby operation.For each category, it will create a list of the top two categories. Data is still integer.

    df = df.groupby('category').apply(lambda x: list(x.product_id)[:2])
    print df
    

    result :

    category
    cat1         [23, 98]
    cat2        [65, 798]
    dtype: object
    

    If you need to have the result as a string, we use a simple lambda operation :

    df.apply(lambda x: '&'.join([str(elem) for elem in x]))
    

    result :

    category
    cat1         23&98
    cat2        65&798
    dtype: object