Search code examples
sqlitepandasapache-spark-sqlpandas-groupbypandasql

GROUP BY rows in DataFrame/table in Pandas/iPython notebook (Jupyter)?


I have a table which looks like this:

Current DataFrame/Table: 'original_table'

   col_1          col_2             col_3
0  Non-Saved      www.google.com    10
1  Non-Saved      www.facebook.com  20
2  Non-Saved      www.linkedin.com  20
3  Saved          www.Quora.com     30
4  Saved          www.gmail.com     40

Can I derive a table such as the following using SQL query?

    col_1          col_2             col_3
 0  Non-Saved      www.google.com    50
                   www.facebook.com
                   www.linkedin.com
 1  Saved          www.Quora.com     70
                   www.gmail.com    

Basically I am expecting a table to have DISTINCT values from col_1, all corresponding values from col_2 and SUM(col_3) for corresponding values from col_3 in one row.

My query:

 sql("""SELECT col_1, group_concat(col_2) as col_2, SUM(col_3) as col_3
 FROM original_table
 GROUP BY col_1
 """).show()

I tried by using an embedded SQL query using group_concat but it gives me the following error:

AnalysisException: u"Undefined function: 'GROUP_CONCAT'. This function 
is neither a registered temporary function nor a permanent function 
registered in the database 'default'.

Can anyone please help me with probably a simple hack?


Solution

  • Use GroupBy.transform - it return same size Series as original group:

    #first convert column to numeric
    df['col_3'] = df['col_3'].astype(int) #float
    
    df['col_3'] = df.groupby('col_1')['col_3'].transform('sum')
    print (df)
           col_1             col_2  col_3
    0  Non-Saved    www.google.com     50
    1  Non-Saved  www.facebook.com     50
    2  Non-Saved  www.linkedin.com     50
    3      Saved     www.Quora.com     70
    4      Saved     www.gmail.com     70
    

    And if need only first value:

    df[['col_1','col_3']] = df[['col_1','col_3']].mask(df['col_1'].duplicated())
    print (df)
           col_1             col_2  col_3
    0  Non-Saved    www.google.com   50.0
    1        NaN  www.facebook.com    NaN
    2        NaN  www.linkedin.com    NaN
    3      Saved     www.Quora.com   70.0
    4        NaN     www.gmail.com    NaN
    

    If necessary, NaNs is possible replace by empty string - in first string column no problem, but in last get mixed types - strings with numbers and some function should failed:

    df[['col_1','col_3']] = df[['col_1','col_3']].mask(df['col_1'].duplicated()).fillna('')
    print (df)
           col_1             col_2 col_3
    0  Non-Saved    www.google.com    50
    1             www.facebook.com      
    2             www.linkedin.com      
    3      Saved     www.Quora.com    70
    4                www.gmail.com      
    

    print (df['col_3'].apply(type))
    0    <class 'float'>
    1      <class 'str'>
    2      <class 'str'>
    3    <class 'float'>
    4      <class 'str'>
    Name: col_3, dtype: object