Search code examples
pysparkdatabricks

Appending Spark dataframe iteratively using PySpark in databricks


I have a list of header keys that I need to iterate through and get data from an API. I am creating a temporary dataframe to hold API response and using union to append data from temp dataframe to final dataframe. This code works but it is very slow. Please help me find an efficient solution.

# Created df_final empty dataframe before the for loop

list1 = [<contains list of lists of header data>]

for i in range(0,len(list1)):
    api_header_data = list1['header']

    # Call Api function

    input_data = get_api_function(<api_header_data>)
    response = postrequest(input_data)
    columns = response.json()["result"]["Headers"]
    data = response.json()["result"]["Data"]
    # Create temp dataframe union it to main dataframe
    df_temp = spark.createDataFrame(data,columns)
    df_final = df_final.union(df_temp)

Solution

  • You can collect the data first and then create the dataframe:

    # Created df_final empty dataframe before the for loop
    
    list1 = [<contains list of lists of header data>]
    
    all_data = []
    
    for headers in list1:
        api_header_data = headers['header']
        
        # Call Api function
    
        input_data = get_api_function(<api_header_data>)
        response = postrequest(input_data)
        columns = response.json()["result"]["Headers"]
        data = response.json()["result"]["Data"]
        all_data.extend(data)
    
    if all_data:
        df_final = spark.createDataFrame(all_data, columns)