Dynamically derive dataframe names for assignment

I have array of columns. For each of the column in the array, I need to perform same operation and create in a separate dataframe. So that later these dataframes can be merged/joined into one.

with below, getting error : cannot assign to expression here. Maybe you meant '==' instead of '='?

I get that the issue is due to we dynamically trying to derive the dataframe name using c +'_df_name' for which we are assigning the data.

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("test").getOrCreate()

data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

attributes = ["name","skill"]

for c in attributes:
    c +'_df_name' = df.select(lit('xyz')) # simplified

So changed the above code as below:

for c in attributes:
    df_name = c +'_df_name'
    print(df_name) # prints name_df_name, skill_df_name
    df_name = df.select(lit('xyz')) # simplified

Now if change try to print the new dataframe value using below, getting error.name 'name_df_name' is not defined.

name_df_name.show()
skill_df_name.show()

How do we dynamically generate unique dataframe names for each assignment during the iteration?

Solution

You are assigning a dataframe to a string literal instead of a container.

data_frames = {f"{c}_df_name": df.select(lit('xyz')) for c in attributes}

By using a dictionary, the list of dataframe names is accessible using the keys method.

data_frames.keys()

The individual dataframes can be accessed using the appropriate key string. e.g.

data_frames["abc_df_name"]

Further Explanation:

attributes = ["name","skill"]

for c in attributes:
    c +'_df_name' = df.select(lit('xyz')) # simplified

The for loop creates two string literals: "name_df_name", "skill_df_name". It's not possible to assign a value to a string literal, the assignment within the loop is a syntax error.

You cannot do this:

"name_df_name" = 1

But you can do this:

name_df_name = 1

If you wish to use your string literals then you could add them to the locals() dictionary, but this would be a dirty hack and is not recommended.

attributes = ["name","skill"]

for c in attributes:
    _ = c +'_df_name'
    if _ not in locals():
        locals()[_] = df.select(lit('xyz')) # simplified