I have a pyspark dataframe with below columns
Dataframe: httpClient
[capacity: string, version: string]
and I have a list of columns declared as
httpClient_fields = ["capacity", "`httpClient.install`", "date"]
I need to check the dataframe if it has the list items. If items does not exist in the dataframe, I need to add it with empty values. So, in the result, I need
Dataframe: httpClient
[capacity: string, version: string, `httpClient.install`: string, date: string]
This is my code now:
df_cols = httpClient.columns
for f in httpClient_fields:
if f not in df_cols:
httpClient= httpClient.withColumn(f, F.lit(''))
httpClient = httpClient.select(*httpClient_fields).dropDuplicates().repartition(1)
httpClient = httpClient.withColumnRenamed("httpClient.install","httpClient_install")
when I execute this, Im getting
cannot resolve '`httpClient.install`'
Please let me know how to solve this
Well, I'm not sure how to really parse the dot('.') in there since you seems to have use backticks already. However, in some cases, this might not work as expected due to parsing issues.
So is it possible for you to replace the '.' with an underscore '_' from inside the loop itself.
Something like this:
for f in httpClient_fields:
if f not in df_cols:
if '.' in f:
f = f.replace('.', '_') # Replace dot with underscore
df_res = df_res.withColumn(f, F.lit(''))
Well the above might not be the thing you are looking for, also i noticed this maybe you can try adding backticks in the last line as well:
replace this :
httpClient = httpClient.withColumnRenamed("httpClient.install","httpClient_install")
with this ( just added backticks in the last-line as well)
httpClient = httpClient.withColumnRenamed("`httpClient.install`", "httpClient_install")