Search code examples
pythonapache-sparkpyspark

Dynamically derive dataframe names for assignment


I have array of columns. For each of the column in the array, I need to perform same operation and create in a separate dataframe. So that later these dataframes can be merged/joined into one.

with below, getting error : cannot assign to expression here. Maybe you meant '==' instead of '='?

I get that the issue is due to we dynamically trying to derive the dataframe name using c +'_df_name' for which we are assigning the data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("test").getOrCreate()

data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

attributes = ["name","skill"]

for c in attributes:
    c +'_df_name' = df.select(lit('xyz')) # simplified

So changed the above code as below:

for c in attributes:
    df_name = c +'_df_name'
    print(df_name) # prints name_df_name, skill_df_name
    df_name = df.select(lit('xyz')) # simplified

Now if change try to print the new dataframe value using below, getting error.name 'name_df_name' is not defined.

name_df_name.show()
skill_df_name.show()

How do we dynamically generate unique dataframe names for each assignment during the iteration?


Solution

  • You are assigning a dataframe to a string literal instead of a container.

    data_frames = {f"{c}_df_name": df.select(lit('xyz')) for c in attributes}
    

    By using a dictionary, the list of dataframe names is accessible using the keys method.

    data_frames.keys()
    

    The individual dataframes can be accessed using the appropriate key string. e.g.

    data_frames["abc_df_name"]
    

    Further Explanation:

    attributes = ["name","skill"]
    
    for c in attributes:
        c +'_df_name' = df.select(lit('xyz')) # simplified
    

    The for loop creates two string literals: "name_df_name", "skill_df_name". It's not possible to assign a value to a string literal, the assignment within the loop is a syntax error.

    You cannot do this:

    "name_df_name" = 1
    

    But you can do this:

    name_df_name = 1
    

    If you wish to use your string literals then you could add them to the locals() dictionary, but this would be a dirty hack and is not recommended.

    attributes = ["name","skill"]
    
    for c in attributes:
        _ = c +'_df_name'
        if _ not in locals():
            locals()[_] = df.select(lit('xyz')) # simplified