Search code examples
apache-sparkapache-spark-sqlapache-spark-connector

Spark DataSource V2 API


I have been trying to create a Custom Spark Data source using V2 API.

As per Jira, it is in master and available for use. But when tried to use it in my project, compilation failed citing that no such package path exists in v3.3.1.

package closed.source.gs

import org.apache.spark.sql.sources.v2.DataSourceV2
class DefaultSource extends DataSourceV2{

}

Error:

...../gs/DefaultSource.scala:3:37:: object v2 is not a member of package org.apache.spark.sql.sources
[error] import org.apache.spark.sql.sources.v2.DataSourceV2

Any ideas on what happened to V2 APIs? Did spark devs refactor this? Are there any good latest articles that can help me here?

Already read but unhelpful links: https://blog.madhukaraphatak.com/spark-datasource-v2-part-3


Solution

  • Data source API v2, introduced in Spark 2.3, has been heavily refactored in Spark 3. Here's API improvement proposal document if you'd like to learn more about motivation behind it.

    So in Spark 2 you were extending DataSourceV2 interface, but don't look for it - it no longer exists. Instead you create a set of classes in a dedicated package.

    Here's a blog post series about implementing a custom data source in Spark 3. It's worth reading through, but if you want jump straight to part 3 for a concrete example with code.