Search code examples
apache-sparkpysparkrdd

How to pass multiple arguments when mapping and filtering RDD?


I currently have this line to filter and apply a function to an RDD.

data_to_update.rdd.map(find_differences).filter(lambda row: bool(row))

I want to modify the find_differences function to also take another argument unique_id in addition to row. I'm not exactly sure how to go about modifying this line to do that, or if there's a better way to write it.


Solution

  • Assuming that your cuurent function looks something like this:

    def find_differences(row):
        # do something
        return result
    

    You can create a new function and a partial function that matches your original function:

    from functools import partial 
    
    def find_differences_id(unique_id, row):
        # do something else
        return another_result
    
    find_differences = partial(find_differences_id, unique_id)
    

    And map the RDD as you did before.