Search code examples
listgroup-bysetpysparkcollect

pyspark collect_set or collect_list with groupby


How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'


Solution

  • You need to use agg. Example:

    from pyspark import SparkContext
    from pyspark.sql import HiveContext
    from pyspark.sql import functions as F
    
    sc = SparkContext("local")
    
    sqlContext = HiveContext(sc)
    
    df = sqlContext.createDataFrame([
        ("a", None, None),
        ("a", "code1", None),
        ("a", "code2", "name2"),
    ], ["id", "code", "name"])
    
    df.show()
    
    +---+-----+-----+
    | id| code| name|
    +---+-----+-----+
    |  a| null| null|
    |  a|code1| null|
    |  a|code2|name2|
    +---+-----+-----+
    

    Note in the above you have to create a HiveContext. See https://stackoverflow.com/a/35529093/690430 for dealing with different Spark versions.

    (df
      .groupby("id")
      .agg(F.collect_set("code"),
           F.collect_list("name"))
      .show())
    
    +---+-----------------+------------------+
    | id|collect_set(code)|collect_list(name)|
    +---+-----------------+------------------+
    |  a|   [code1, code2]|           [name2]|
    +---+-----------------+------------------+