Search code examples
pythonpyspark

How to select items inside a python list and add it to a dataframe


I have a pyspark dataframe with below columns

Dataframe: httpClient
[capacity: string, version: string]

and I have a list of columns declared as httpClient_fields = ["capacity", "`httpClient.install`", "date"]

I need to check the dataframe if it has the list items. If items does not exist in the dataframe, I need to add it with empty values. So, in the result, I need

Dataframe: httpClient
[capacity: string, version: string, `httpClient.install`: string, date: string]

This is my code now:

df_cols = httpClient.columns
for f in httpClient_fields:
    if f not in df_cols:
        httpClient= httpClient.withColumn(f, F.lit(''))
httpClient = httpClient.select(*httpClient_fields).dropDuplicates().repartition(1)
httpClient = httpClient.withColumnRenamed("httpClient.install","httpClient_install")

when I execute this, Im getting cannot resolve '`httpClient.install`'

Please let me know how to solve this


Solution

  • Well, I'm not sure how to really parse the dot('.') in there since you seems to have use backticks already. However, in some cases, this might not work as expected due to parsing issues.

    So is it possible for you to replace the '.' with an underscore '_' from inside the loop itself.

    Something like this:

    for f in httpClient_fields:
        if f not in df_cols:
            if '.' in f:
                f = f.replace('.', '_')  # Replace dot with underscore
            df_res = df_res.withColumn(f, F.lit(''))
    

    Well the above might not be the thing you are looking for, also i noticed this maybe you can try adding backticks in the last line as well:

    replace this :

    httpClient = httpClient.withColumnRenamed("httpClient.install","httpClient_install")
    
    

    with this ( just added backticks in the last-line as well)

    httpClient = httpClient.withColumnRenamed("`httpClient.install`", "httpClient_install")