Search code examples
azurepysparkazure-data-factoryazure-databricks

Is there alternative functions in PySpark as countAll,countAllDistinct in Azure data Flow


I have count-->counts null exclusive countAll---> counts null inclusive Similarly countDistinct ,countAllDistinct in Azure Data Flow Aggregation transformation.

But in pySpark,there are only count() and countDistinct().So how to achieve countAll() and CountAllDistict() in PySpark.


Solution

  • As you know there is no Built-in functions for CountAll and CountDistinctAll to work around it you can follow below:

    To count all rows in a PySpark DataFrame, including null values, you can use the count() function directly without using the when() function.

    If you have dataframe in pyspark:

    from pyspark.sql.functions import count, col, sum, countDistinct, when
    #count all rows excluding null
    df.select(count("column name")).show()
    #count all rows including null
    df.select(sum(when(col("column name").isNull() | col("column name").isNotNull(), 1).otherwise(0))).show()
    
      
    #count all distinct rows excluding null
    df.select(countDistinct("column name")).show()
    #count all distinct rows including null
    df.select("column name").distinct().count()
    

    Execution and OUTPUT: enter image description here

    If you have dataframe in pyspark:

    #count all rows including null
    SELECT  COUNT(*) AS total_count FROM (Select value from sampleView);
    
    #count all distinct rows including null
    SELECT  COUNT(*) AS distinct_total_count FROM (Select  distinct value from sampleView);
    

    Execution and OUTPUT:

    enter image description here