Search code examples
sql-serverazurepysparkazure-synapse

Azure Synapse pyspark translates STRING datatype into varchar(8000) for external table


I try to load an external table in Azure Synpase using a PySpark notebook but the datatypes seem to mismatch. When I initialize the table I execute (stripped down example):

CREATE OR REPLACE TABLE LANDING_DB.Opportunity (
 AccountId varchar(255),
 Id varchar(255),
 Name STRING,
 Description STRING
)
USING DELTA LOCATION 'abfss://[email protected]/Landing/ABC/Opportunity/'

via

spark.sql()

Yet the external table uses 'varchar(8000)' as datatype for the Name and Description column. I would like to use varchar(max) as especially the Description column can have a lot of text.

Am I missing some settings or is this a bug?

The database is the serverless pool on the synapse instance.

Help would be appreciated, thank you in advance

Best, Sven


Solution

  • First of all note the tip: Try to make VARCHAR() as short as possible. Avoid VARCHAR(MAX) if possible because it can impair performance.

    It seems that it makes STRING automatically to varchar(8000) because it infers the length. See https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/sql/best-practices-serverless-sql-pool.md#check-inferred-data-types

    Instead of using STRING you can use varchar(x) and then put a higher number. Maybe you know what the maximum could be in the Description? varchar() takes an integer as the argument, and the max value of an integer is 2,147,483,647. I don't think your description can get that long? So better put a lower number which will also not be reached.