I have count-->counts null exclusive countAll---> counts null inclusive Similarly countDistinct ,countAllDistinct in Azure Data Flow Aggregation transformation.
But in pySpark,there are only count() and countDistinct().So how to achieve countAll() and CountAllDistict() in PySpark.
As you know there is no Built-in functions for CountAll
and CountDistinctAll
to work around it you can follow below:
To count all rows in a PySpark DataFrame, including null values, you can use the count()
function directly without using the when()
function.
If you have dataframe in pyspark:
from pyspark.sql.functions import count, col, sum, countDistinct, when
#count all rows excluding null
df.select(count("column name")).show()
#count all rows including null
df.select(sum(when(col("column name").isNull() | col("column name").isNotNull(), 1).otherwise(0))).show()
#count all distinct rows excluding null
df.select(countDistinct("column name")).show()
#count all distinct rows including null
df.select("column name").distinct().count()
Execution and OUTPUT:
If you have dataframe in pyspark:
#count all rows including null
SELECT COUNT(*) AS total_count FROM (Select value from sampleView);
#count all distinct rows including null
SELECT COUNT(*) AS distinct_total_count FROM (Select distinct value from sampleView);
Execution and OUTPUT: