Search code examples
apache-sparkamazon-s3databricksspark-streamingdelta-lake

problem reading files in delta lake house - data streaming


I'm building my first delta lake house in databricks, alone. I need to read files that are in csv format in a bucket in AWS, I can view the folder with my files by doing display(dbutils.fs.ls. However, when doing readStream, it only reads a single file, with all files has the same name, differentiating only the last 4 digits, which are years (2013,2014, 2015).

example: data_from_2013.csv , data_from_2014.csv.

And the only file read, which has exactly the same structure as all the other files, is from 2018. I believe it is something that I am failing because I need to insert a new column for the year in each file, and the only place that has this data is within the file in column B, line 1, of each file. And after that, I need to delete the first three lines of each file.

path_file = "/mnt/path/path_data/path_data_year/data_from_****.csv"

df_bronze_despesUF_funcao = spark.readStream \
    .format("csv") \
    .schema(schema) \
    .options(header='true', inferSchema='true', 
             delimiter=';', encoding='iso-8859-1', skiprows=3) \
    .load(path_file)

Please, anyone can help me?

I've read the Databricks documentation, watched videos, but I can't complete this task


Solution

  • Instead of path_file with _****.csv, Just declared path_file as folder

    path_file = "/mnt/path/path_data/path_data_year/"
    
    df_bronze_despesUF_funcao = spark.readStream
    .format("csv")
    .schema(schema)
    .options(header='true', inferSchema='true', delimiter=';', encoding='iso-8859-1', skiprows=3)
    .load(path_file)