I am working on databricks interactive cluster and I want to query csv files stored in ADLS or DBFS using SQL.
Issue:
I can easily do that for delta table and parquet files.However, when I do this for csv with specific delimiters I get all the columns loaded as one column.
%sql SELECT *
from delta.`dbfs:/path/to/delta/table`
^^ displays output correctly
%sql SELECT *
from csv.`dbfs:/path/to/file.csv`
^^ displays output as one column without headers. Basically it needs some options passed to specify sep and headers
Potential Solutions:
I can easily do this in spark, spark.read.csv("dbfs:/path/to/file.csv", header=True, sep="|")
but how can I do this in Spark SQL (magic command)?
I can create a table specifying options
%sql CREATE TABLE schema.table
USING CSV
OPTIONS (path "dbfs:/path/to/file.csv", header "true", delimiter "|")
and then I can query the data.
But is there a way to just select * the csv data without creating these intermediate tables?
Yes, Databricks Support the feature to read CSV file format using SQL with OPTIONS Directly but NOT as the Syntax mentioned Below.
SELECT
*
FROM
CSV.`/databricks-datasets/wine-quality/winequality-red.csv`
OPTIONS (
header='true',
delimiter=','
)
Instead, it provides the solution in slightly different syntax using table-valued function read_files
SELECT
*
FROM
read_files(
'/databricks-datasets/wine-quality/winequality-red.csv',
header => "True",
sep => ";"
)
You can go through the official documentation for more detail, as of now this function Supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC
file formats and available in Databricks Runtime 13.1 and above.