azure pyspark azure-data-factory azure-databricks

Is there alternative functions in PySpark as countAll,countAllDistinct in Azure data Flow

I have count-->counts null exclusive countAll---> counts null inclusive Similarly countDistinct ,countAllDistinct in Azure Data Flow Aggregation transformation.

But in pySpark,there are only count() and countDistinct().So how to achieve countAll() and CountAllDistict() in PySpark.

Solution

As you know there is no Built-in functions for CountAll and CountDistinctAll to work around it you can follow below:

To count all rows in a PySpark DataFrame, including null values, you can use the count() function directly without using the when() function.

If you have dataframe in pyspark:

from pyspark.sql.functions import count, col, sum, countDistinct, when
#count all rows excluding null
df.select(count("column name")).show()
#count all rows including null
df.select(sum(when(col("column name").isNull() | col("column name").isNotNull(), 1).otherwise(0))).show()

  
#count all distinct rows excluding null
df.select(countDistinct("column name")).show()
#count all distinct rows including null
df.select("column name").distinct().count()

Execution and OUTPUT: enter image description here

If you have dataframe in pyspark:

#count all rows including null
SELECT  COUNT(*) AS total_count FROM (Select value from sampleView);

#count all distinct rows including null
SELECT  COUNT(*) AS distinct_total_count FROM (Select  distinct value from sampleView);

Execution and OUTPUT:

enter image description here