Search code examples
apache-spark-sqldatabricksazure-databricksdatabricks-sql

How to read csv files in dbfs using Spark SQL only?


I am working on databricks interactive cluster and I want to query csv files stored in ADLS or DBFS using SQL.

Issue:

I can easily do that for delta table and parquet files.However, when I do this for csv with specific delimiters I get all the columns loaded as one column.

%sql SELECT *
from delta.`dbfs:/path/to/delta/table`

^^ displays output correctly

%sql SELECT *
from csv.`dbfs:/path/to/file.csv`

^^ displays output as one column without headers. Basically it needs some options passed to specify sep and headers

Potential Solutions:

I can easily do this in spark, spark.read.csv("dbfs:/path/to/file.csv", header=True, sep="|") but how can I do this in Spark SQL (magic command)?

I can create a table specifying options

%sql CREATE TABLE schema.table
USING CSV
OPTIONS (path "dbfs:/path/to/file.csv", header "true", delimiter "|")

and then I can query the data.

But is there a way to just select * the csv data without creating these intermediate tables?


Solution

  • Yes, Databricks Support the feature to read CSV file format using SQL with OPTIONS Directly but NOT as the Syntax mentioned Below.

    SELECT
      *
    FROM
      CSV.`/databricks-datasets/wine-quality/winequality-red.csv`
       OPTIONS (
            header='true',
            delimiter=','
          )
    

    enter image description here


    Instead, it provides the solution in slightly different syntax using table-valued function read_files

    SELECT
      *
    FROM
      read_files(
        '/databricks-datasets/wine-quality/winequality-red.csv',
        header => "True",
        sep => ";"
      ) 
    

    enter image description here


    You can go through the official documentation for more detail, as of now this function Supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC file formats and available in Databricks Runtime 13.1 and above.