I have array of columns. For each of the column in the array, I need to perform same operation and create in a separate dataframe. So that later these dataframes can be merged/joined into one.
with below, getting error : cannot assign to expression here. Maybe you meant '==' instead of '='?
I get that the issue is due to we dynamically trying to derive the dataframe name using c +'_df_name' for which we are assigning the data.
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("test").getOrCreate()
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
attributes = ["name","skill"]
for c in attributes:
c +'_df_name' = df.select(lit('xyz')) # simplified
So changed the above code as below:
for c in attributes:
df_name = c +'_df_name'
print(df_name) # prints name_df_name, skill_df_name
df_name = df.select(lit('xyz')) # simplified
Now if change try to print the new dataframe value using below, getting error.name 'name_df_name' is not defined.
name_df_name.show()
skill_df_name.show()
How do we dynamically generate unique dataframe names for each assignment during the iteration?
You are assigning a dataframe to a string literal instead of a container.
data_frames = {f"{c}_df_name": df.select(lit('xyz')) for c in attributes}
By using a dictionary, the list of dataframe names is accessible using the keys
method.
data_frames.keys()
The individual dataframes can be accessed using the appropriate key string. e.g.
data_frames["abc_df_name"]
Further Explanation:
attributes = ["name","skill"]
for c in attributes:
c +'_df_name' = df.select(lit('xyz')) # simplified
The for loop creates two string literals: "name_df_name", "skill_df_name". It's not possible to assign a value to a string literal, the assignment within the loop is a syntax error.
You cannot do this:
"name_df_name" = 1
But you can do this:
name_df_name = 1
If you wish to use your string literals then you could add them to the locals()
dictionary, but this would be a dirty hack and is not recommended.
attributes = ["name","skill"]
for c in attributes:
_ = c +'_df_name'
if _ not in locals():
locals()[_] = df.select(lit('xyz')) # simplified