I am using the DESCRIBE
keyword to get column information about a temp view. It's a useful method however I have a table for which I'd only like to describe a subset of the columns. I'm trying to use LIMIT
in conjunction with DESCRIBE
to achieve this but can not figure it out.
Here is a toy dataset (created with pyspark):
# make some test data
columns = ['id', 'dogs', 'cats', 'horses', 'people']
vals = [
(1, 2, 0, 4, 3),
(2, 0, 1, 2, 4)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView('df')
And now describe with sql:
%%sql
DESCRIBE df
Output:
col_name data_type
id bigint
dogs bigint
cats bigint
horses bigint
people bigint
In reality I have many more columns than this and what I'd like to do is LIMIT
the output of this query. Here are a couple of things I've tried:
Attempt #1:
DESCRIBE df
LIMIT 3
Error:
An error was encountered:
"\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: "\nextraneous input '3' expecting {<EOF>, '.'}(line 3, pos 6)\n\n== SQL ==\n\nDESCRIBE df\nLIMIT 3 \n------^^^\n"
Attempt #2:
SELECT a.*
FROM (
DESCRIBE df
) AS a
LIMIT 3
Error:
An error was encountered:
'Table or view not found: DESCRIBE; line 4 pos 4'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 603, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Table or view not found: DESCRIBE; line 4 pos 4'
Does anyone know if it's possible to limit the output of describe?
Here is a way to limit the output of DESCRIBE
using pyspark.sql.dataframe.limit()
. Use pyspark.sql.context.sql()
to run the DESCRIBE
query. This returns the results as a DataFrame and you can call limit()
:
df.registerTempTable('df')
spark.sql('DESCRIBE df').limit(3).show()
#+--------+---------+-------+
#|col_name|data_type|comment|
#+--------+---------+-------+
#| id| bigint| null|
#| dogs| bigint| null|
#| cats| bigint| null|
#+--------+---------+-------+
However, if you're just looking for the data types for the columns, you can use the dtypes
attribute of the DataFrame:
df.dtypes
#[('id', 'bigint'),
# ('dogs', 'bigint'),
# ('cats', 'bigint'),
# ('horses', 'bigint'),
# ('people', 'bigint')]
This is a list of tuples, which can be sliced according to what you want:
df.dtypes[0:3]
#[('id', 'bigint'), ('dogs', 'bigint'), ('cats', 'bigint')]
There is also a describe()
method for DataFrames that returns summary statistics:
df.describe().show()
#+-------+------------------+------------------+------------------+------------------+------------------+
#|summary| id| dogs| cats| horses| people|
#+-------+------------------+------------------+------------------+------------------+------------------+
#| count| 2| 2| 2| 2| 2|
#| mean| 1.5| 1.0| 0.5| 3.0| 3.5|
#| stddev|0.7071067811865476|1.4142135623730951|0.7071067811865476|1.4142135623730951|0.7071067811865476|
#| min| 1| 0| 0| 2| 3|
#| max| 2| 2| 1| 4| 4|
#+-------+------------------+------------------+------------------+------------------+------------------+
If you wanted to limit the columns, you could use select()
and specify a slice of df.columns
:
df.select(df.columns[0:3]).describe().show()
#+-------+------------------+------------------+------------------+
#|summary| id| dogs| cats|
#+-------+------------------+------------------+------------------+
#| count| 2| 2| 2|
#| mean| 1.5| 1.0| 0.5|
#| stddev|0.7071067811865476|1.4142135623730951|0.7071067811865476|
#| min| 1| 0| 0|
#| max| 2| 2| 1|
#+-------+------------------+------------------+------------------+