Search code examples
scalaapache-sparkdelta-lakeapache-spark-dataset

deltalake scala api for unit-testing


I was able to make deltalake work locally for unit-testing data+spark app logic.

      def readDeltaLake(path: String)(implicit sc: SparkSession): DataFrame =
        sc.read
          .format("org.apache.spark.sql.delta.sources.DeltaDataSource")
          .load(path)


    // local spark session
    implicit val sparkSession: SparkSession = aSparkSession() 
    import sparkSession.implicits._

    // path to scala/test/resources with parquet file
    io.delta.tables.DeltaTable.convertToDelta(sparkSession, s"parquet.`${singleInput.getParent.toFile.getAbsolutePath}`") 

    val myTestData = readDeltaLake(singleInput.getParent.toFile.getAbsolutePath)
    myTestData.count() shouldBe 42L

Code above works fine, but I want to mimic real delta lake layout with partitions. My partition schema is:

hdfs://my_data/delta/ds=2024-05-27 23%3A00%3A00

how can I create same thing but with date partitions?


Solution

  • Per the comments and the link writing delta:

    val data = spark.range(5, 10)
    data.partitionBy("","").write.format("delta").mode("overwrite").save("/tmp/delta-table")
    df.show()