I read and filter data, need to count how each filter operation affects result. Is it possible to somehow mixin spark accumulators while using Dataframe/Dataset api?
Sample code:
sparkSession.read
.format("org.apache.spark.sql.delta.sources.DeltaDataSource")
.load(path)
// use spark accumulator to count records that passed filter
.where(col("ds") >= dateFromInclusive and col("ds") < dateToExclusive)
// same here
.where(col("origin").isin(origins)
You can use count_if to count multiple filters (and get the counts in one pass) but you can't simultaneously filter rows with them as per your code example.
example from Sql function documentation:
> SELECT count_if(col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
2
> SELECT count_if(col IS NULL) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
1