Consider a generic writeStream
invocation - with the typical "console" output format:
out.writeStream
.outputMode("complete")
.format("console")
.start()
What are the alternatives? I noticed actually that the default is parquet
:
In DataStreamWriter
:
/**
* Specifies the underlying output data source.
*
* @since 2.0.0
*/
def format(source: String): DataStreamWriter[T] = {
this.source = source
this
}
private var source: String = df.sparkSession.sessionState.conf.defaultDataSourceName
In SQLConf
:
def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME)
val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.stringConf
.createWithDefault("parquet")
But then how is the path for the parquet file specified? What are the other formats supported and what options do they have/require?
Here is the official spark documentation for the same: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
As of spark 2.4.1, five formats are supported out of the box:
On top of that one can also implement her custom sink by extending Sink
API of Spark: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala