I have a table which looks like this:
Current DataFrame/Table: 'original_table'
col_1 col_2 col_3
0 Non-Saved www.google.com 10
1 Non-Saved www.facebook.com 20
2 Non-Saved www.linkedin.com 20
3 Saved www.Quora.com 30
4 Saved www.gmail.com 40
Can I derive a table such as the following using SQL query?
col_1 col_2 col_3
0 Non-Saved www.google.com 50
www.facebook.com
www.linkedin.com
1 Saved www.Quora.com 70
www.gmail.com
Basically I am expecting a table to have DISTINCT values from col_1, all corresponding values from col_2 and SUM(col_3) for corresponding values from col_3 in one row.
My query:
sql("""SELECT col_1, group_concat(col_2) as col_2, SUM(col_3) as col_3
FROM original_table
GROUP BY col_1
""").show()
I tried by using an embedded SQL query using group_concat but it gives me the following error:
AnalysisException: u"Undefined function: 'GROUP_CONCAT'. This function
is neither a registered temporary function nor a permanent function
registered in the database 'default'.
Can anyone please help me with probably a simple hack?
Use GroupBy.transform
- it return same size Series
as original group:
#first convert column to numeric
df['col_3'] = df['col_3'].astype(int) #float
df['col_3'] = df.groupby('col_1')['col_3'].transform('sum')
print (df)
col_1 col_2 col_3
0 Non-Saved www.google.com 50
1 Non-Saved www.facebook.com 50
2 Non-Saved www.linkedin.com 50
3 Saved www.Quora.com 70
4 Saved www.gmail.com 70
And if need only first value:
df[['col_1','col_3']] = df[['col_1','col_3']].mask(df['col_1'].duplicated())
print (df)
col_1 col_2 col_3
0 Non-Saved www.google.com 50.0
1 NaN www.facebook.com NaN
2 NaN www.linkedin.com NaN
3 Saved www.Quora.com 70.0
4 NaN www.gmail.com NaN
If necessary, NaN
s is possible replace by empty string - in first string
column no problem, but in last get mixed types - strings with numbers and some function should failed:
df[['col_1','col_3']] = df[['col_1','col_3']].mask(df['col_1'].duplicated()).fillna('')
print (df)
col_1 col_2 col_3
0 Non-Saved www.google.com 50
1 www.facebook.com
2 www.linkedin.com
3 Saved www.Quora.com 70
4 www.gmail.com
print (df['col_3'].apply(type))
0 <class 'float'>
1 <class 'str'>
2 <class 'str'>
3 <class 'float'>
4 <class 'str'>
Name: col_3, dtype: object