Search code examples
pythonapache-sparkdatabricksazure-databricks

How to read files in parallel in DataBricks?


Could someone tell me how to read files in parallel? I'm trying something like this:

def processFile(path):
  df = spark.read.json(path)
  return df.count()

paths = ["...", "..."]

distPaths = sc.parallelize(paths)
counts = distPaths.map(processFile).collect()
print(counts)

It fails with the following error:

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Is there any other way to optimize this?


Solution

  • In your particular case, you can just pass the whole paths array to DataFrameReader:

    df = spark.read.json(paths)
    

    ...and reading its file elements will be parallelized by Spark.