Search code examples
pythonapache-sparkpysparkapache-spark-sqldatabricks

How to query for the maximum / highest value in an field with PySpark


The following dataframe will produce values 0 to 3.

df = DeltaTable.forPath(spark, '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1').history().select(col("version"))

enter image description here

Can someone show me how to modify the dataframe such that it only provides the maximum value i.e 3?

I have tried

df.select("*").max("version")

And

df.max("version")

But no luck

Any thoughts?


Solution

  • Use Max function, This should work:

    df.select(F.max("version").alias("max_version")).show()
    

    or

    df.agg(F.max("version").alias("max_version")).show()
    

    Input:

    +-------+
    |version|
    +-------+
    |      0|
    |      1|
    |      3|
    |      2|
    +-------+
    

    Output:

    +-----------+
    |max_version|
    +-----------+
    |          3|
    +-----------+