Search code examples
selectpysparkdescribe

Is there a way for using describe function in PySpark for more than one column?


I am trying to get some info from a dataset in PySpark and when I combine select function with describe function to see three columns details, the result just showing the last column's information. I used a simple example from an article with this command:

my_data.select('Isball', 'Isboundary', 'Runs').describe().show()

and it should show me three columns details but it just show me this:

+-------+------------------+
|summary|              Runs|
+-------+------------------+
|  count|               605|
|   mean|0.9917355371900827|
| stddev| 1.342725481259329|
|    min|                 0|
|    max|                 6|
+-------+------------------+

what should I do to get the results that I am looking for?


Solution

  • The describe function works only on numeric and string columns as described in the documentation.

    I'm assuming Isball and Isboundary are boolean columns thus their describe can't be seen. you can cast the columns to integer for it to work.

    from pyspark.sql.functions import col
    
    df = spark.createDataFrame([
        (1, True, "lorem"),
        (2, False, "ipsum")
    ], ["integer_col", "bool_col", "string_col"])
    
    df.describe().show(truncate=0)
    
    +-------+------------------+----------+
    |summary|integer_col       |string_col|
    +-------+------------------+----------+
    |count  |2                 |2         |
    |mean   |1.5               |null      |
    |stddev |0.7071067811865476|null      |
    |min    |1                 |ipsum     |
    |max    |2                 |lorem     |
    +-------+------------------+----------+
    
    
    df.withColumn("bool_col", col("bool_col").cast("integer")).describe().show(truncate=0)
    
    +-------+------------------+------------------+----------+
    |summary|integer_col       |bool_col          |string_col|
    +-------+------------------+------------------+----------+
    |count  |2                 |2                 |2         |
    |mean   |1.5               |0.5               |null      |
    |stddev |0.7071067811865476|0.7071067811865476|null      |
    |min    |1                 |0                 |ipsum     |
    |max    |2                 |1                 |lorem     |
    +-------+------------------+------------------+----------+