Search code examples
sqlpysparkapache-spark-sqlsql-limit

DESCRIBE with LIMIT in spark sql


I am using the DESCRIBE keyword to get column information about a temp view. It's a useful method however I have a table for which I'd only like to describe a subset of the columns. I'm trying to use LIMIT in conjunction with DESCRIBE to achieve this but can not figure it out.

Here is a toy dataset (created with pyspark):

# make some test data
columns = ['id', 'dogs', 'cats', 'horses', 'people']
vals = [
     (1, 2, 0, 4, 3),
     (2, 0, 1, 2, 4)
]

# create DataFrame
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView('df')

And now describe with sql:

%%sql

DESCRIBE df

Output:

col_name    data_type
id          bigint
dogs        bigint
cats        bigint
horses      bigint
people      bigint

In reality I have many more columns than this and what I'd like to do is LIMIT the output of this query. Here are a couple of things I've tried:

Attempt #1:

DESCRIBE df
LIMIT 3

Error:

An error was encountered:
"\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
    raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: "\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"

Attempt #2:

SELECT a.*
FROM (
    DESCRIBE df
) AS a
LIMIT 3

Error:

An error was encountered:
'Table or view not found: DESCRIBE; line 4 pos 4'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Table or view not found: DESCRIBE; line 4 pos 4'

Does anyone know if it's possible to limit the output of describe?


Solution

  • Here is a way to limit the output of DESCRIBE using pyspark.sql.dataframe.limit(). Use pyspark.sql.context.sql() to run the DESCRIBE query. This returns the results as a DataFrame and you can call limit():

    df.registerTempTable('df')
    spark.sql('DESCRIBE df').limit(3).show()
    #+--------+---------+-------+
    #|col_name|data_type|comment|
    #+--------+---------+-------+
    #|      id|   bigint|   null|
    #|    dogs|   bigint|   null|
    #|    cats|   bigint|   null|
    #+--------+---------+-------+
    

    However, if you're just looking for the data types for the columns, you can use the dtypes attribute of the DataFrame:

    df.dtypes
    #[('id', 'bigint'),
    # ('dogs', 'bigint'),
    # ('cats', 'bigint'),
    # ('horses', 'bigint'),
    # ('people', 'bigint')]
    

    This is a list of tuples, which can be sliced according to what you want:

    df.dtypes[0:3]
    #[('id', 'bigint'), ('dogs', 'bigint'), ('cats', 'bigint')]
    

    There is also a describe() method for DataFrames that returns summary statistics:

    df.describe().show()
    #+-------+------------------+------------------+------------------+------------------+------------------+
    #|summary|                id|              dogs|              cats|            horses|            people|
    #+-------+------------------+------------------+------------------+------------------+------------------+
    #|  count|                 2|                 2|                 2|                 2|                 2|
    #|   mean|               1.5|               1.0|               0.5|               3.0|               3.5|
    #| stddev|0.7071067811865476|1.4142135623730951|0.7071067811865476|1.4142135623730951|0.7071067811865476|
    #|    min|                 1|                 0|                 0|                 2|                 3|
    #|    max|                 2|                 2|                 1|                 4|                 4|
    #+-------+------------------+------------------+------------------+------------------+------------------+
    

    If you wanted to limit the columns, you could use select() and specify a slice of df.columns:

    df.select(df.columns[0:3]).describe().show()
    #+-------+------------------+------------------+------------------+
    #|summary|                id|              dogs|              cats|
    #+-------+------------------+------------------+------------------+
    #|  count|                 2|                 2|                 2|
    #|   mean|               1.5|               1.0|               0.5|
    #| stddev|0.7071067811865476|1.4142135623730951|0.7071067811865476|
    #|    min|                 1|                 0|                 0|
    #|    max|                 2|                 2|                 1|
    #+-------+------------------+------------------+------------------+