Search code examples
apache-sparkhbasesparse-matrixgoogle-cloud-dataprocgoogle-cloud-bigtable

Spark HBase/BigTable - Wide/sparse dataframe persistence


I want to persist to BigTable a very wide Spark Dataframe (>100,000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost).

Is there a way to specify in Spark to ignore nulls when writing?

Thanks !


Solution

  • Probably (didn't test it), before writing a Spark DataFrame to HBase/BigTable you can transform it by filtering out columns with null values in each row using custom function, as suggested here for an example using pandas : https://stackoverflow.com/a/59641595/3227693. However there is no built-in connector supporting this feature to my best knowledge.

    Alternatively, you can try store data in columnar file formats like Parquet instead, because they are efficiently handle persistence of sparse columnar data (at least in terms of output size in bytes). But to avoid writing many small files (due to sparse nature of the data) which can decrease write throughput, you probably will need to decrease number of output partitions before performing a write (i.e. write more rows per each parquet file: Spark parquet partitioning : Large number of files)