Search code examples
amazon-web-servicesamazon-s3etlaws-glueaws-glue-spark

How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark?


I tried model.save(sc, path) it gves me error : TypeError: save() takes 2 positional arguments but 3 were given. Here sc is the sparkcontext [sc = SparkContext()] I tried without sc in the signature but got this error : An error occurred while calling o159.save. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found I tried multiple approaches using boto3 pickle joblib, but I haven't succeeded in finding a solution that works. I am creating a KMeans clustering model. I need a glue job to fit and save the model in S3 and then another glue job to make predictions by loading the saved model. I am doing this first time any help would be appreciated.


Solution

  • Adding the line after SparkContext solved my problem.

    sc = SparkContext()

    sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.DirectFileOutputCommitter")