Search code examples
azure-synapseazure-synapse-analyticsazure-notebooks

Storing a View in a Dataframe in Azure Synapse Notebook


I'm attempting to store the top 100 rows of a view from an SQL database to a dataframe using Synapse Notebook. I have tried using this code:

%%pyspark
df = spark.sql("SELECT DataType, TableName, ColumnName FROM dbo.ExternalTableSchema LIMIT 100")
df.show()

But I consistently get the error below even though I have confirmed the name of the view is correct.

enter image description here

I have also tried a variation of the code above:

%%pyspark
df = spark.sql("SELECT TOP (100) [DataType], [TableName], [ColumnName] FROM [dbo].[ExternalTableSchema]")

df.show()

But I get a different error shown below. I am unsure of how to resolve this. Any help would be greatly appreciated!

enter image description here


Solution

  • I have tried the below approach:

    from pyspark.sql import SparkSession
    jdbcHostname = "dilisqlsrv.database.windows.net"
    jdbcDatabase = "db02"
    jdbcPort = "1433"
    username = "dilip"
    password = "Welcome@1"
    jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
    query = "(SELECT * FROM DepartmentView) AS tmp"
    df = spark.read \
        .format("jdbc") \
        .option("url", jdbcUrl) \
        .option("dbtable", query) \
        .option("user", username) \
        .option("password", password) \
        .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
        .load()
    df.show()
    

    Results:

    +------+---------+-----------+---------------+
    |DeptID| DeptName|   Location|        Manager|
    +------+---------+-----------+---------------+
    |     1|    Sales|   New York|     John Smith|
    |     2|Marketing|Los Angeles|       Jane Doe|
    |     3|  Finance|    Chicago|Michael Johnson|
    |     4|       HR|    Houston|    Emily Brown|
    +------+---------+-----------+---------------+
    

    In the above code I have set JDBC connection parameters Reading data into a DataFrame using pushdown query.

    As you mentioned you need to query top 100 rows you can use the below

    query = "(SELECT TOP 100 * FROM YourTableName) AS tmp"
    

    Here is how you can do it using spark.sql

    tableName = "DepartmentView"
    spark.read \
        .format("jdbc") \
        .option("url", jdbcUrl) \
        .option("dbtable", tableName) \
        .option("user", username) \
        .option("password", password) \
        .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
        .load() \
        .createOrReplaceTempView("temp_table")
        
    top100_rows_df = spark.sql("SELECT * FROM temp_table LIMIT 100")
    top100_rows_df.show()
    

    Results:

    +------+---------+-----------+---------------+
    |DeptID| DeptName|   Location|        Manager|
    +------+---------+-----------+---------------+
    |     1|    Sales|   New York|     John Smith|
    |     2|Marketing|Los Angeles|       Jane Doe|
    |     3|  Finance|    Chicago|Michael Johnson|
    |     4|       HR|    Houston|    Emily Brown|
    +------+---------+-----------+---------------+
    

    In the above code registering the Azure SQL Database table as a temporary view. Executing SQL query using spark.sql to read the top 100 rows