It is possible to use spark Dataframe/Dataset api with accumulators?

I read and filter data, need to count how each filter operation affects result. Is it possible to somehow mixin spark accumulators while using Dataframe/Dataset api?

Sample code:

sparkSession.read
  .format("org.apache.spark.sql.delta.sources.DeltaDataSource")
  .load(path)
  // use spark accumulator to count records that passed filter 
  .where(col("ds") >= dateFromInclusive and col("ds") < dateToExclusive)
  // same here
  .where(col("origin").isin(origins)

Solution

You can use count_if to count multiple filters (and get the counts in one pass) but you can't simultaneously filter rows with them as per your code example.

example from Sql function documentation:

> SELECT count_if(col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
 2
> SELECT count_if(col IS NULL) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
 1

Create new variable based on partial string matching in column R
How to select rows between a certain date range in python-polars?
Sample from each group in polars dataframe?
Calculate transition probabilities
how to Send dataframe as html table with font styling based on text value as a email attachment
How to subset R dataframe based on specific values in several columns?
Unable to concatenate dataframes in streamlit
Take min and max dates for a sequence along a column
How to perform split/merge/unpivot with Python and polars?
Efficient manner to compare and match structures between two data frames in R?
Find non-overlapping intervals within DNA coordinates
How to expand a single-index DataFrame to a multiindex DataFrame in an efficient way? (python, pandas)
How convert a list into multiple columns and a dataframe?
(Polars) How to get element from a column with list by index specified in another column
Cumulative subtraction in Pandas Dataframe?
Create Pivot table and add additional columns from another dataframe
How can I map a field of a polars struct from values of another field `a`, to values of another field `b`?
Polars pivot + unpivot operation with multiple values (pandas stack / unstack alternative / UDF over)
How can I subclass a Pandas DataFrame?
Modifing a Pandas Dataframe using Pivot Tables or Group By
Why polars date time subseting is slow?
Iterating and updating a column using a list
Pandas create % and # distribution list in descending order for each group
Store numpy.array in cells of a Pandas.DataFrame
Pandas - Merge DataFrame, keep non-null values on common columns, keep average on another column
Multiply Columns Together Based on Condition
Conversion of Decimal Minutes Seconds into Decimal Degrees when cells within the Columns Contains the Factor 'Missed' in R
How to generate a table with a list of possible entries, rather than the count?
Polars column from conditioned look up of dictionary values
Passing column names as strings do not work with filter() and only with filter()