Search code examples
pythonapache-sparkconfigobj

Text file sent to Spark worker looks empty or not found


I want to send a basic config file to every Spark worker. Config file is written for Python's configobj. I specify it while submitting job.

$ ./bin/spark-submit --files .../config.cfg .../spark_str_hello.py

But when I try to read it, turns out that it doesn't exist there. When I try print config.sections (which should return a list), empty list is printed. Below is basic example for wordcount. I also tried to initialize config on workers with foreachRDD, had the same result. Is there any special way to send text files to Spark workers?

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from configobj import ConfigObj

config = ConfigObj('config.cfg')


sc = SparkContext()
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream('localhost', 9999)
words = lines.flatMap(lambda x: x.split(' '))
pairs = lines.map(lambda x: (x, 1))
wordCount = pairs.reduceByKey(lambda x, y: x + y)
print config.sections

pairs.pprint()
ssc.start()
ssc.awaitTermination()

Solution

  • You need to use SparkFiles.get("FILE") to access the files sent via --files