Spark is able to read parquet files that underlie a hive table.
I would like to modify how this works for a bespoke use case. From what I read, I need to provide an implementation of TableProvider (and a few other interfaces). I do not want to write this implementation from scratch, I simply want to tweak the current way spark reads from hive tables and from parquet files. However I am having difficulty working out where the current implementation is of TableProvider for either hive tables or directly for parquet, could someone point me at this?
Further, I cannot find any clear description of how to make a provider available and how to use the provider within pyspark. (I am happy to write the TableProvider in either java or python, everything I see suggests it has to be java).
For reference I am using AWS EMR to provide spark.
Also for reference, initially the parts I will tweak relate to which files will be read rather than how they are read.
I am hoping for qualified class names of the classes I need to copy, and how to tell spark to use my modified copies.
The ParquetDataSourceV2 which extends FileDataSourceV2 is registered via Java SPI in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister. There are a bunch more items defined in org/apache/spark/sql/execution/datasources/parquet like ParquetReadSupport and ParquetWriteSupport that you may need on your journey.