Search code examples
python-3.xpyspark

Is there an efficient way in Pyspark to find an array's element that has the highest value but return another field?


I'd like to do something similar to array_max() but I need to return another field that's not being used to sort.

Here's the data that's in a column, let's call it df.idInfo:

[
    {
    "accountId": 123,
    "lastUpdated": {
        "sourceSystemTimestamp": "2023-01-01 07:44:58.000",
        "userId": null
        },
    "individualNumber": 33333
    },
    {
    "accountId": 456,
    "lastUpdated": {
        "sourceSystemTimestamp": "2023-02-12 18:44:58.000",
        "userId": null
        },
    "individualNumber": 22222
    },
    {
    "accountId": 789,
    "lastUpdated": {
        "sourceSystemTimestamp": "2023-03-23 23:44:58.000",
        "userId": null
        },
    "individualNumber": 11111
    }
]

I want to get the individualNumber from the element that has the latest value of lastUpdated.sourceSystemTimestamp.

Is there any efficient way to do this? Any built-in functions to Python or Pyspark I could leverage?


Solution

  • You can use array_max to get the max value of sourceSystemTimestamp and use that to filter the same array to fetch the corresponding individualNumber.

    df = (df.withColumn('max_time', 
                        F.array_max(
                            F.transform('idInfo', lambda x: x['lastUpdated']['sourceSystemTimestamp'])
                        ))
          .withColumn('max_number',
                      F.filter('idInfo', lambda x: x['lastUpdated']['sourceSystemTimestamp'] == F.col('max_time'))[0]['individualNumber']))