Search code examples
apache-sparksortingpysparksql-order-bylargenumber

Order by on large number in PySpark


I have a PySpark dataframe that has a string column that is a large number(up to 40 digits). My goal is to sort it. I tried to cast to decimal, but it doesn't work if the number is more than 38 digits. Here is an example dataframe to illustrate the issue.

from pyspark.sql import Row

# Column price has a 40 digit number.
product_updates = [
    {'product_id': '00001', 'product_name': 'Heater', 'price': '1111111111111111111111111111111111111111', 'category': 'Electronics'}, 
    {'product_id': '00006', 'product_name': 'Chair', 'price': '50', 'category': 'Furniture'},
    {'product_id': '00007', 'product_name': 'Desk', 'price': '60', 'category': 'Furniture'}
]
df_product_updates = spark.createDataFrame(Row(**x) for x in product_updates)

# Order by price
df_product_updates.createOrReplaceTempView("sort_price")
df_sort_price = spark.sql(f"""
    select *,
           row_number() over (order by price DESC) rn
    from sort_price
""")

df_sort_price.show(truncate=False)

Is there a way to compare the numbers so that the largest one is ranked 1?

+----------+------------+----------------------------------------+-----------+---+
|product_id|product_name|price                                   |category   |rn |
+----------+------------+----------------------------------------+-----------+---+
|00007     |Desk        |60                                      |Furniture  |1  |
|00006     |Chair       |50                                      |Furniture  |2  |
|00001     |Heater      |1111111111111111111111111111111111111111|Electronics|3  |
+----------+------------+----------------------------------------+-----------+---+

thank you


Solution

  • You could sort by your price column cast to a double in descending order:

    import pyspark.sql.functions as F
    from pyspark.sql.window import Window
    
    window = Window.orderBy(F.col("price").astype("double").desc())
    df_product_updates.withColumn("rn", F.row_number().over(window)).show(truncate=False)
    
    +----------+------------+----------------------------------------+-----------+---+
    |product_id|product_name|price                                   |category   |rn |
    +----------+------------+----------------------------------------+-----------+---+
    |00001     |Heater      |1111111111111111111111111111111111111111|Electronics|1  |
    |00007     |Desk        |60                                      |Furniture  |2  |
    |00006     |Chair       |50                                      |Furniture  |3  |
    +----------+------------+----------------------------------------+-----------+---+
    

    Note that here I just ordered by the casted column, the final dataframe still has the original schema (so your price column is still a StringType)

    If you want this using SQL you can do this:

    df_sort_price = spark.sql(
        f"""
        select *, row_number() over (order by cast(price as double) desc) rn
        from sort_price
    """
    )