Search code examples
azureazure-databricksdatabricks-connectdatabricks-sql

Handling spaces in the abfss using COPY INTO with Azure Databricks


I am trying to use the COPY INTO statement in Databricks to ingest CSV files from Cloud Storage.

The problem is that the folder name has a space in it /AP Posted/ and when I try to refer to the path the code execution raises the below error:

Error in SQL statement: URISyntaxException: Illegal character in path at index 70: abfss://[email protected]/RAW/AP Posted/

I googled the error and found articles advising to replace the space with "%20". This solution is not effective.

So, does someone knows how to solve it? Or the only solution is indeed to prevent spaces in naming folders.

This is my current Databricks SQL Code:

COPY INTO prod_gbs_gpdi.bronze_data.my_table
FROM 'abfss://[email protected]/RAW/AP Posted/'
FILEFORMAT = CSV
VALIDATE 500 ROWS
PATTERN = 'AP_SAPEX_KPI_001 - Posted Invoices in 2021_3.CSV'
FORMAT_OPTIONS(
    'header'='true', 
    'delimiter'=';', 
    'skipRows'='8', 
    'mergeSchema'='true',  --Whether to infer the schema across multiple files and to merge the schema of each file
    'encoding'='UTF-8', 
    'enforceSchema'='true',  --Whether to forcibly apply the specified or inferred schema to the CSV files
    'ignoreLeadingWhiteSpace'='true',
    'ignoreTrailingWhiteSpace'='true',
    'mode'='PERMISSIVE'  --Parser mode around handling malformed records
)
COPY_OPTIONS (
    'force' = 'true',  --If set to true, idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.
    'mergeSchema'= 'true'  --If set to true, the schema can be evolved according to the incoming data.
)

Solution

  • Trying to use the path where one of the folders has space, gave the same error.

    enter image description here

    • To overcome this, you can specify the folder in PATTERN parameter as follows:
    %sql
    COPY INTO table1
    FROM '/mnt/repro/op/'
    FILEFORMAT = csv
    PATTERN='has space/sample1.csv'
    FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
    COPY_OPTIONS ('mergeSchema' = 'true');
    

    enter image description here

    • Or, giving path as path/has?space/ also works. But if there are multiple folders like has space, hasAspace, hasBspace etc., then this would not work as expected.
    %sql
    COPY INTO table2
    FROM '/mnt/repro/op/has?space/'
    FILEFORMAT = csv
    PATTERN='sample1.csv'
    FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
    COPY_OPTIONS ('mergeSchema' = 'true');
    

    enter image description here

    • Another alternative is to copy the file to dbfs using dbutils.fs.cp() and then use dbfs path to use COPY INTO.
    dbutils.fs.cp('/mnt/repro/op/has space/sample1.csv','/FileStore/tables/mycsv.csv')
    
    %sql
    COPY INTO table3
    FROM '/FileStore/tables/'
    FILEFORMAT = csv
    PATTERN='mycsv.csv'
    FORMAT_OPTIONS ('mergeSchema' = 'true','header'='true')
    COPY_OPTIONS ('mergeSchema' = 'true');
    

    enter image description here