I am trying to use the Linux command-line tool 'Poppler' to extract information from pdf files. I want to do this for a huge amount of PDFs on several Spark workers. I need to use Popplers, not PyPDF or anything alike.
Does anybody know how to install Poppler on the workers? I know that I can do command-line calls from within python, and fetch the output (or fetch the generated file by the Poppler lib), but how do I install it on each worker? Im using spark 1.3.1 (databricks).
Thank you!
The proper way is to install it on all your workers when you initially set them up as you would install any other Linux application. As you already pointed out, you can then shell out from within Python.
If that is not an option for whatever reason, then you can ship files to all workers using the addFile
method: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.addFile
Note that the latter approach does not take care of dependencies (libraries etc.).