Search code examples
unionazure-data-lakepyspark

Read a list of csv files from datalake and union them into a single pyspark dataframe


I am trying to read a list of csv files from Azure datalake one by one and after some checking, I want to union all into a single dataframe.

fileList = dbutils.fs.ls(file_input_path)

for i in fileList:
  try:
    file_path = i.path
    print(file_path)
      
  except Exception as e:
    raise Exception(str(e))

In this case, I want to read csv from file_path with a custom schema and union all of then into a single dataframe.

I could only read one csv as below. How to read each and every csv and union them all as one single dataframe?

df = spark.read.csv(file_path, header = True, schema=custom_schema)

How to achieve this diligently? Thanks.


Solution

  • I managed to read and union as below.

    fileList = dbutils.fs.ls(file_input_path)
    output_df = spark.createDataFrame([],schema=custom_schema)
    
    for i in fileList:
      try:
        file_path = i.path
        df = spark.read.csv(file_path, header=True, schema=custom_schema)
        output_df = output_df.union(df)
        
      except Exception as e:
        raise Exception(str(e))