Search code examples
pythonapache-sparkpickledill

Using python lime as a udf on spark


I'm looking to use lime's explainer within a udf on pyspark. I've previously trained the tabular explainer, and stored is as a dill model as suggested in link

loaded_explainer = dill.load(open('location_to_explainer','rb'))

def lime_explainer(*cols):
    selected_cols = np.array([value for value in cols])
    exp = loaded_explainer.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
    mapping = exp.as_map()[1]

    return str(mapping)

This however takes a lot of time, as it appears a lot of the computation happens on the driver. I've then been trying to use spark broadcast to broadcast the explainer to the executors.

broadcasted_explainer= sc.broadcast(loaded_explainer)

def lime_explainer(*col):
    selected_cols = np.array([value for value in cols])
    exp = broadcasted_explainer.value.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
    mapping = exp.as_map()[1]

    return str(mapping)        

However, I run into a pickling error, on broadcast.

PicklingError: Can't pickle at 0x7f69fd5680d0>: attribute lookup on lime.discretize failed

Can anybody help with this? Is there something like dill that we can use instead of the cloudpickler used in spark?


Solution

  • Looking at this source, it seems like you have no choice but to use the pickler provided. As such, I can only suggest that you nest dill inside of the default pickler. Not ideal, but it could work. Try something like:

    broadcasted_explainer = dill.loads(sc.broadcast(dill.dumps(loaded_explainer)).value)
    

    Or you might try calling the Dill extend() method which is supposed to add Dill datatypes into the default pickle package dispatch. No idea if that will work but you can try it!