Search code examples
azurepysparkazure-synapseazure-synapse-pipeline

Creating an array of quarterly dates in azure synapse pipeline


I am working in the Azure Synapse Pipeline environment and I am trying to initialize an array that will contain the dates "03-31", "06-30", "09-30" and "12-31" for dates within the range of 2018 to the current year. So currently, it would be all quarterly dates through "2024-03-31". I am building this out to do a rest API call where I will pass each of the dates in the array to pull reports to my ADLS. I already have the ADLS piece figured out, its just a matter of actually getting this array to work.

I have tried several solutions to get this working, but my experience in synapse is quite limited. I started by trying to create a pyspark notebook that would be ran at the start of the pipeline and generate a .json with all the necessary quarterly dates in the ADLS. I was able to get this to work, but then, when I pulled the .json into the pipeline, I couldn't convert it to an array variable.

I have settled with generating the quarterly dates array manually for the time being. But I can't help to expect there is a simple approach to solve this issue. It doesn't seem like it should be that complicated.

def getDateRange(start_date):
# Define start date and current date
current_date_str = spark.sql("SELECT CAST(current_date() AS STRING)").collect()[0][0]

print("Current Date:")
print(current_date_str)

end_year = int(current_date_str[:4]) + 1

# Generate a sequence of dates within the range up to the current date

date_range_df = spark.range(0, (365 * (end_year - 2018) + 1)).selectExpr("date_add('{}', CAST(id AS INT)) as date".format(start_date))

# Filter out dates beyond the current date
date_range_df = date_range_df.filter(col("date") <= current_date())

# Extract month and day
date_range_df = date_range_df.withColumn("month", expr("month(date)")).withColumn("day", expr("day(date)"))

# Filter dates to keep only the end of each quarter
quarterly_dates_df = date_range_df.filter(
    (date_range_df.month == 3) & (date_range_df.day == 31) |
    (date_range_df.month == 6) & (date_range_df.day == 30) |
    (date_range_df.month == 9) & (date_range_df.day == 30) |
    (date_range_df.month == 12) & (date_range_df.day == 31)
)

# Format dates as strings in 'MM-dd-yyyy' format
quarterly_dates_df = quarterly_dates_df.withColumn("date_string", date_format("date", "MM-dd-yyyy"))

return quarterly_dates_df

There is an additional piece of code to get this data into the ADLS system, but it contains sensitive info. I don't have the pipeline piece I had built out to share, but I was trying to do a lookup, then store the output to an array, but the array was storing as a single element instead of as a list.


Solution

  • @json() in azure synapse pipeline will convert the string value into Json itself.

    I tried to give string value '[1,2,3] as an input to @json() expression, it gave the array value as output [1,2,3].

    Input:

    Activity output:

    Similarly, you can give the output of lookup activity in the json function to convert the string output of lookup activity to array of Json itself.

    Sample expression: @json(activity('Lookup1').output.value)