Search code examples
pythonpysparketl

transform dataframe: Do you know how group string columns in pyspark?


I am currently working with the next dataframe:

A B C D E
1 2 some null something A
1 2 some something B null

And I need the following output:

A B C D E
1 2 some something B something A

My problem is that I can't made a groupBy using string columns.

I tried using self joining and pivot.


Solution

  • What about something like this?

    from pyspark.sql import functions as F
    
    cols_for_groupby = ["A", "B", "C"]
    (
        df
        .groupby(cols_for_groupby)
        .agg(*[
            F.max(c).alias(c)
            for c in df.columns if c not in cols_for_groupby
        ])
    )
    

    If df is the DataFrame in the question, the result is:

    +---+---+----+-----------+-----------+
    |  A|  B|   C|          D|          E|
    +---+---+----+-----------+-----------+
    |  1|  2|some|something B|something A|
    +---+---+----+-----------+-----------+