Search code examples
pythonpysparkdatabrickspyspark-pandasdatabricks-unity-catalog

AttachDistributedSequence is not supported in Unity Catalog


I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error:

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;
AttachDistributedSequence[__index_level_0__#767L, _c0#734, carat#735, cut#736, color#737, clarity#738, depth#739, table#740, price#741, x#742, y#743, z#744] Index: __index_level_0__#767L
+- SubqueryAlias spark_catalog.default.diamonds
   +- Relation hive_metastore.default.diamonds[_c0#734,carat#735,cut#736,color#737,clarity#738,depth#739,table#740,price#741,x#742,y#743,z#744] csv

The table was created following the Databricks Quick Start notebook:

DROP TABLE IF EXISTS diamonds;
 
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

I'm trying to read the table with

import pyspark.pandas as ps
psdf = ps.read_table("hive_metastore.default.diamonds")

and get the error above.

Reading the table into spark.sql.DataFrame works fine with

df = spark.read.table("hive_metastore.default.diamonds")

The cluster versions are

Databricks Runtime Version 11.2
Apache Spark 3.3.0
Scala 2.12

I'm familiar with pandas already and would like to use pyspark.pandas.DataFrame since I assume it will have a familiar API and be quick for me to learn and use.

The questions I have:

  • What does the error mean?
  • What can I do to read the tables to pyspark.pandas.DataFrame?
  • Alternatively, should I just learn pyspark.sql.DataFrame and use that? If so, why?

Solution

  • The AttachDistributedSequence is a special extension used by Pandas on Spark to create a distributed index. Right now it's not supported on the Shared clusters enabled for Unity Catalog due the restricted set of operations enabled on such clusters. The workarounds are:

    • Use single-user Unity Catalog enabled cluster
    • Read table using the Spark API, and then use pandas_api function (doc) to convert into Pandas on Spark DataFrame. (in Spark 3.2.x/3.3.x it's called to_pandas_on_spark (doc)):
    pdf = spark.read.table("abc").pandas_api()
    

    P.S. It's not recommended to use .toPandas as it will pull all data to the driver node.