Search code examples
rsparkr

Groupby in sparkR not giving desired results


I have created a data frame out of mtcars. I do a group by gear and cyl. then i calculate max for hp and disp. something is going wrong in group by as there should be 8 groups in there. while i get only 6 groups.

library(sparkR)
xx=as.DataFrame(sqlContext, data = mtcars)

head(agg(groupBy(xx, "gear", "cyl"), hp = 'max'))
  gear cyl max(hp)
1    3   8     245
2    5   4     113
3    3   4      97
4    4   4     109
5    5   6     175
6    3   6     110

Update 1:

I have another query, in the documentation of groupby we have an example as:

## Examples

## Not run: 
  # Compute the average for all numeric columns grouped by department.
  avg(groupBy(df, "department"))

  # Compute the max age and average salary, grouped by department and gender.
  agg(groupBy(df, "department", "gender"), salary="avg", "age" -> "max")

## End(Not run)

similarly for mtcars i came up with

agg(groupBy(xx, "gear", "cyl"), qsec ="avg", "disp" -> "max")

Firstly my understanding is that we get max of disp, but the code doesn't seem to work. It gives out error as follows. Second thing is that the code work with an = in place of ->. So is there a typo or something?

unable to find an inherited method for function ‘groupBy’ for signature ‘"function"’

My SparkR version is SparkR_1.6.1.


Solution

  • Your aggregation is well, but you are adding a 'head' in first, it will show you just the first 6 lines . You need to replace it by a collect. like this:

    df <- as.DataFrame(mtcars)
    gp = agg(groupBy(df, df$gear, df$cyl), hp = 'max')
    collect(gp)
    

    Just a remark, i am using spark 2.0.2