I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table
and receive the following error:
AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;
AttachDistributedSequence[__index_level_0__#767L, _c0#734, carat#735, cut#736, color#737, clarity#738, depth#739, table#740, price#741, x#742, y#743, z#744] Index: __index_level_0__#767L
+- SubqueryAlias spark_catalog.default.diamonds
+- Relation hive_metastore.default.diamonds[_c0#734,carat#735,cut#736,color#737,clarity#738,depth#739,table#740,price#741,x#742,y#743,z#744] csv
The table was created following the Databricks Quick Start notebook:
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
I'm trying to read the table with
import pyspark.pandas as ps
psdf = ps.read_table("hive_metastore.default.diamonds")
and get the error above.
Reading the table into spark.sql.DataFrame
works fine with
df = spark.read.table("hive_metastore.default.diamonds")
The cluster versions are
Databricks Runtime Version 11.2
Apache Spark 3.3.0
Scala 2.12
I'm familiar with pandas already and would like to use pyspark.pandas.DataFrame
since I assume it will have a familiar API and be quick for me to learn and use.
pyspark.pandas.DataFrame
?pyspark.sql.DataFrame
and use that? If so, why?The AttachDistributedSequence
is a special extension used by Pandas on Spark to create a distributed index. Right now it's not supported on the Shared clusters enabled for Unity Catalog due the restricted set of operations enabled on such clusters. The workarounds are:
pandas_api
function (doc) to convert into Pandas on Spark DataFrame. (in Spark 3.2.x/3.3.x it's called to_pandas_on_spark
(doc)):pdf = spark.read.table("abc").pandas_api()
P.S. It's not recommended to use .toPandas
as it will pull all data to the driver node.