azure-synapse azure-synapse-analytics azure-notebooks

Storing a View in a Dataframe in Azure Synapse Notebook

I'm attempting to store the top 100 rows of a view from an SQL database to a dataframe using Synapse Notebook. I have tried using this code:

%%pyspark
df = spark.sql("SELECT DataType, TableName, ColumnName FROM dbo.ExternalTableSchema LIMIT 100")
df.show()

But I consistently get the error below even though I have confirmed the name of the view is correct.

I have also tried a variation of the code above:

%%pyspark
df = spark.sql("SELECT TOP (100) [DataType], [TableName], [ColumnName] FROM [dbo].[ExternalTableSchema]")

df.show()

But I get a different error shown below. I am unsure of how to resolve this. Any help would be greatly appreciated!

Solution

I have tried the below approach:

from pyspark.sql import SparkSession
jdbcHostname = "dilisqlsrv.database.windows.net"
jdbcDatabase = "db02"
jdbcPort = "1433"
username = "dilip"
password = "Welcome@1"
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
query = "(SELECT * FROM DepartmentView) AS tmp"
df = spark.read \
    .format("jdbc") \
    .option("url", jdbcUrl) \
    .option("dbtable", query) \
    .option("user", username) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
df.show()

Results:

+------+---------+-----------+---------------+
|DeptID| DeptName|   Location|        Manager|
+------+---------+-----------+---------------+
|     1|    Sales|   New York|     John Smith|
|     2|Marketing|Los Angeles|       Jane Doe|
|     3|  Finance|    Chicago|Michael Johnson|
|     4|       HR|    Houston|    Emily Brown|
+------+---------+-----------+---------------+

In the above code I have set JDBC connection parameters Reading data into a DataFrame using pushdown query.

As you mentioned you need to query top 100 rows you can use the below

query = "(SELECT TOP 100 * FROM YourTableName) AS tmp"

Here is how you can do it using spark.sql

tableName = "DepartmentView"
spark.read \
    .format("jdbc") \
    .option("url", jdbcUrl) \
    .option("dbtable", tableName) \
    .option("user", username) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load() \
    .createOrReplaceTempView("temp_table")
    
top100_rows_df = spark.sql("SELECT * FROM temp_table LIMIT 100")
top100_rows_df.show()

Results:

+------+---------+-----------+---------------+
|DeptID| DeptName|   Location|        Manager|
+------+---------+-----------+---------------+
|     1|    Sales|   New York|     John Smith|
|     2|Marketing|Los Angeles|       Jane Doe|
|     3|  Finance|    Chicago|Michael Johnson|
|     4|       HR|    Houston|    Emily Brown|
+------+---------+-----------+---------------+

In the above code registering the Azure SQL Database table as a temporary view. Executing SQL query using spark.sql to read the top 100 rows