Search code examples
scalaapache-spark

What is the difference between StructType and Row in spark?


I am new to spark scala. Could someone help clarify the confusion below?

Question 1: When spark dataframe contains struct, the spark UDF function often takes input arguments type such as Row or Seq[Row].

a. What is the difference between Row and StructType?

b. Why couldn't the spark UDF function takes input Seq[StructType]?

c. Seq is scala datatype, while Row is spark datatype. Why does the UDF function mix these two datatype?

Question 2: When creating a dataframe, why does the simpleData mix scala datatype Seq and spark datatype Row? Could it be Seq(StructType("James ","","Smith","36636","M",3000), ?

val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
    Row("Michael ","Rose","","40288","M",4000),
    Row("Robert ","","Williams","42114","M",4000),
    Row("Maria ","Anne","Jones","39192","F",4000),
    Row("Jen","Mary","Brown","","F",-1)
  )

val simpleSchema = StructType(Array(
    StructField("firstname",StringType,true),
    StructField("middlename",StringType,true),
    StructField("lastname",StringType,true),
    StructField("id", StringType, true),
    StructField("gender", StringType, true),
    StructField("salary", IntegerType, true)
  ))

  val df = spark.createDataFrame(
      spark.sparkContext.parallelize(simpleData),simpleSchema)
  df.printSchema()
  df.show()

Follow up:

I find the spark "Data type" and "Value type in Scala". spark "Data type" is StructType, while "Value type in Scala" is org.apache.spark.sql.Row. What is the difference between DataType and ValueType? https://spark.apache.org/docs/latest/sql-ref-datatypes.html


Solution

  • What is the difference between Row and StructType?

    StructType is a builtin DataType from org.apache.spark.sql.types that implements scala.collection.Seq<StructField>.

    In simple words, it is a Seq[StructFields] and is used to define the schema for dataframes/datasets.

    While Row object is a value of StructType.