Search code examples
apache-sparkpysparkaws-glue

AWS Glue does not give coherent result for pyspark - orderBy


when running pyspark locally I get correct results with list ordered by BOOK_ID, But when deploying the AWS Glue job, the books seem not to be ordered

root
 |-- AUTHORID: integer
 |-- NAME: string 
 |-- BOOK_LIST: array 
 |    |-- BOOK_ID: integer 
 |    |-- BOOK_NAME: string 
    from pyspark.sql import functions as F
    
    result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
              .orderBy(F.col("BOOK_ID").desc())
              .groupBy("AUTHOR_ID", "NAME")
              .agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
              )

Note: I'm using pyspark 3.2.1 and Glue 2.0

Any suggestion please


Solution

  • Supposition

    Although I managed to run the job on Glue 3.0 that supports spark 3.1, the orderBy still giving wrong result

    Migrating from AWS Glue 2.0 to AWS Glue 3.0

    The solution that seems to give a good result is to reduce the number of workers to 2 which is the minimum allowed number of workers enter image description here

    The explanation is: Glue jobs may have many workers that allow parallelism, thus the orderBy can't give a correct result in contrary where we have only one worker

    Suggested Sollution

    • Use the minimum number of workers (which is not a pertinent solution)
    • Apply the .orderBy for each dataframe before the join
    • Or use .coalesce(1)
     result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
                  .coalesce(1)
                  .orderBy(F.col("BOOK_ID").desc())
                  .groupBy("AUTHOR_ID", "NAME")
                  .agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
                  )
    

    Which allow to get the right result but in this case we lose in performance