Search code examples
apache-sparkformathbase

How to find all the options when read/write with Spark for a specific format?


Is there any way to find all the options when reading/writing with Spark for a specific format? I think they must be in the source code somewhere but I can't find it.

Below is my code to use spark to read data from Hbase, it works fine, but I want to know where the options hbase.columns.mapping and hbase.table come from. Are there any other options?

  val spark = SparkSession.builder().master("local").getOrCreate()
  val hbaseConf =  HBaseConfiguration.create()
  hbaseConf.set("hbase.zookeeper.quorum", "vftsandbox-namenode,vftsandbox-snamenode,vftsandbox-node03")

  new HBaseContext(spark.sparkContext, hbaseConf)

  val hbaseTable = "mytable"
  val columnMapping =
    """id STRING :key,
      mycfColumn1 STRING mycf:column1,
      mycfColumn2 STRING mycf:column2,
      mycfCol1 STRING mycf:col1,
      mycfCol3 STRING mycf:col3
      """
  val hbaseSource = "org.apache.hadoop.hbase.spark"

  val hbaseDF = spark.read.format(hbaseSource)
    .option("hbase.columns.mapping", columnMapping)
    .option("hbase.table", hbaseTable)
    .load()
  hbaseDF.show()

I mean if it's format(csv) or format(json) then there are some docs on the internet with all the options, but for this specific format (org.apache.hadoop.hbase.spark), I have no luck. Even with the case of csv or json, all the options on the internet must come from the code, right? They can't just imagine it out.

Now I think the problem is "how to find all the spark options in the source code in general". I try using IntelliJ Idea search tool to search from all places (even in the source code libraries) but no luck so far. Can't find anything related to hbase.columns.mapping or hbase.table at all (already tried hbase_columns_mapping too), there are no thing related in org.apache.hadoop.hbase.spark either, there are only instances in my code.

find

I also find these lines in the console after running the code. But the HbaseRelation class is some "decompiled" class with all the ???

17:53:51.205 [main] DEBUG org.apache.spark.util.ClosureCleaner -      HBaseRelation(Map(hbase.columns.mapping -> id STRING :key,
      mycfColumn1 STRING mycf:column1,
      mycfColumn2 STRING mycf:column2,
      mycfCol1 STRING mycf:col1,
      mycfCol3 STRING mycf:col3
      , hbase.table -> mytable),None)

I think there are some possibilities that it only appears at runtime/compile-time but I'm not sure


Solution

  • Because non-built-in formats implemented in arbitrary code, there is no certain way of finding the options other than going through the available documentation and source code unfortunately.

    For example, do the steps below to find the HBase Connector options.

    1. Search for the HBase Connector documentation/source code online.
    2. Notice that the documentation mentions the HBaseTableCatalog object; have a look at its definition.
    3. Notice that the repository's readme file and various code snippets online mention other options such as hbase.spark.pushdown.columnfilter; find out where they are defined in the repository. In this case it's defined in the HBaseSparkConf object.

    Also, please note that writing and reading operations may have different sets of options.