using spark scala azure databricks.
i have a dataframe(df1) with 100+ columns. i need to create another dataframe(df2) and want to have same schema for that also. How can i store the schema of df1 into a variable and apply to df2
val inputDF = Seq(("00163E0F765C1ED79593228BF70CEE41" ,"PD PUMPS")
,("00164E0F775C1ED79593228BF70CEE42" ,"PD PUMPS")
,("00165E0F785C1ED79593228BF70CEE43" ,"PD PUMPS")
,("00166E0F795C1ED79593228BF70CEE44" ,"PD PUMPS")
,("00167E0F405C1ED79593228BF70CEE45" ,"PD PUMPS")
).toDF("objectID")
val expectedDF = Seq(("00163E0F765C1ED79593228BF70CEE41" ,"PD PUMPS1")
,("00164E0F775C1ED79593228BF70CEE42" ,"PD PUMPS1")
,("00165E0F785C1ED79593228BF70CEE43" ,"PD PUMPS1")
,("00166E0F795C1ED79593228BF70CEE44" ,"PD PUMPS1")
,("00167E0F405C1ED79593228BF70CEE45" ,"PD PUMPS1")
).toDF("objectID","equipmentName", inputDF.schema)
purpose of doing this is : I am actually writing unit test case. I have one function which is adding a column into the passed dataframe(parameterised). So i need to create inputDF and then i need to create expectedDF with 1 more column in inputDF.
Given that for the inputDF generation you should use a sequence of just one column (or pass two string in the toDF method), I would do as follow:
val inputDF = Seq(("00163E0F765C1ED79593228BF70CEE41")
,("00164E0F775C1ED79593228BF70CEE42")
,("00165E0F785C1ED79593228BF70CEE43")
,("00166E0F795C1ED79593228BF70CEE44")
,("00167E0F405C1ED79593228BF70CEE45")
).toDF("objectID")
val seq = Seq(("00163E0F765C1ED79593228BF70CEE41" ,"PD PUMPS1")
,("00164E0F775C1ED79593228BF70CEE42" ,"PD PUMPS1")
,("00165E0F785C1ED79593228BF70CEE43" ,"PD PUMPS1")
,("00166E0F795C1ED79593228BF70CEE44" ,"PD PUMPS1")
,("00167E0F405C1ED79593228BF70CEE45" ,"PD PUMPS1")
)
val rdd = spark.sparkContext.parallelize(seq)
val rows: RDD[Row] = rdd.map((row: (String, String)) => {
Row(row.productIterator.toList:_*)
})
val expectedDF = spark.createDataFrame(rows,
inputDF.schema.add(StructField("NewColumn", org.apache.spark.sql.types.StringType )))
df.show()
The idea is to create a Dataframe from the sequence with the createDataFrame and pass it the schema (i.e. a StructType) of the old data frame with the addition of one entry (a StructField).