Search code examples
apache-spark-sqldatabricksazure-databricksdatabricks-sql

Spark SQL where clause with Dates between not returning data though it has data for that dates


I am expecting the ABC_df to display/return data in my databricks notebook. but its not returning the data though there is data for the period '2019-02-24' AND '2019-03-31' I tried using

WHERE DATE_FORMAT(E.EVENT_TIME_UTC,'yyyy-MM-dd') BETWEEN '2019-02-24' AND '2019-03-31'

where condition it works fine and successfully returns the data

But I want to implement a dynamic parameter,So I am using below options but data is not returning

BETWEEN CAST('{IterStartLagDatetime}' AS STRING) AND CAST('{IterEndDatetime}' AS STRING)
BETWEEN to_date('{IterStartLagDatetime}','yyyy-MM-dd') AND to_date('{IterEndDatetime}','yyyy-MM-dd')
BETWEEN DATE_FORMAT('{IterStartLagDatetime}','yyyy-MM-dd') AND DATE_FORMAT('{IterEndDatetime}','yyyy-MM-dd')

complete code is listed above. please help me with any other possible solutions

%python
BatchInsert_StartYear = 2019
BatchInsert_EndYear = 2019
while (BatchInsert_StartYear <= BatchInsert_EndYear):
print(BatchInsert_StartYear)
MonthCount = 1
while (MonthCount < 13):
if(MonthCount < 12):  
IterEndDatetime = right('00'+ str(MonthCount+1),2)+'-01-'+ str(BatchInsert_StartYear)
IterEndDatetime = datetime.strptime(IterEndDatetime,'%m-%d-%Y')+ timedelta(days=-1)
IterEndDatetime = IterEndDatetime.strftime("%Y-%m-%d")
print(IterEndDatetime)
IterStartDatetime = right('00'+ str(MonthCount),2)+'-01-'+ str(BatchInsert_StartYear)
IterStartLagDatetime = datetime.strptime(IterStartDatetime,'%m-%d-%Y')+ timedelta(days=-5)
IterStartDatetime = datetime.strptime(IterStartDatetime,'%m-%d-%Y').strftime("%Y-%m-%d")
print(IterStartDatetime)
    
IterStartLagDatetime = IterStartLagDatetime.strftime("%Y-%m-%d")
print(IterStartLagDatetime)
else:
IterEndDatetime = right('00'+ str(1),2)+'-01-'+ str(BatchInsert_StartYear+1)
IterEndDatetime = datetime.strptime(IterEndDatetime,'%m-%d-%Y')+ timedelta(days=-1)
IterEndDatetime = IterEndDatetime.strftime("%Y-%m-%d")
print(IterEndDatetime)
IterStartDatetime = right('00'+ str(MonthCount),2)+'-01-'+ str(BatchInsert_StartYear)
IterStartLagDatetime = datetime.strptime(IterStartDatetime,'%m-%d-%Y')+ timedelta(days=-5)
IterStartDatetime = datetime.strptime(IterStartDatetime,'%m-%d-%Y').strftime("%Y-%m-%d")
print(IterStartDatetime)
IterStartLagDatetime = IterStartLagDatetime.strftime("%Y-%m-%d")
print(IterStartLagDatetime)
ABC_df = spark.sql('''
SELECT *
FROM Temp_HISTORICAL_SWIPE_DETAILS E
BETWEEN DATE_FORMAT('{IterStartLagDatetime}','yyyy-MM-dd') AND DATE_FORMAT('{IterEndDatetime}','yyyy-MM-dd')
''')
ABC_df.show()

Solution

  • From the given code, I can see that you are trying to reference variables directly inside spark.sql() directly. Look at the following demonstration to understand how to use variables inside the query (string).

    • I have the following data in my table demo3 as an example.

    enter image description here

    • When I use a similar query as yours, it returns one record as shown below:
    %sql
    
    select * from demo3 WHERE DATE_FORMAT(dt,'yyyy-MM-dd') BETWEEN '2019-02-24' AND '2019-03-31'
    

    enter image description here

    • Now let's say you want to use variables inside this query (you have created these date variables directly, but I have given the value directly for demonstration).

    • If I give the query same as yours, we will not get desired. You can use another variable query to check what query is getting executed and then give that query variable

    startdate='2019-02-24'
    enddate='2019-03-31'
    
    query = '''select * from demo3 WHERE DATE_FORMAT(dt,'yyyy-MM-dd') BETWEEN '{startdate}' AND '{enddate}' '''
    print(query)
    df = spark.sql(query)
    df.show()
    

    enter image description here

    • You can use f-strings in python strings to reference the variables inside strings. So, giving query as below would give the desired result.
    startdate='2019-02-24'
    enddate='2019-03-31'
    
    #use f"...{variable}..." to reference variables.
    query = f'''select * from demo3 WHERE DATE_FORMAT(dt,'yyyy-MM-dd') BETWEEN '{startdate}' AND '{enddate}' '''
    print(query)
    df = spark.sql(query)
    df.show()
    

    enter image description here