when running pyspark locally I get correct results with list ordered by BOOK_ID, But when deploying the AWS Glue job, the books seem not to be ordered
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
Note: I'm using pyspark 3.2.1
and Glue 2.0
Any suggestion please
Although I managed to run the job on Glue 3.0 that supports spark 3.1
, the orderBy still giving wrong result
Migrating from AWS Glue 2.0 to AWS Glue 3.0
The solution that seems to give a good result is to reduce the number of workers to 2 which is the minimum allowed number of workers
The explanation is: Glue jobs may have many workers that allow parallelism, thus the orderBy can't give a correct result in contrary where we have only one worker
join
.coalesce(1)
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.coalesce(1)
.orderBy(F.col("BOOK_ID").desc())
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
Which allow to get the right result but in this case we lose in performance