Search code examples
pysparkazure-databricks

PySpark - Select dataframe.select if column exists


I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.

This all works fine until I get to the final call, because my statement is expecting a column (json value) that no longer exists because its the end of the paginated collection.

How can I test for the existence of the field before I attempt to do a dataframe.select that doesn't return the column and thus fails my procedure.

Schema Example

root
 |-- d: struct (nullable = true)
 |    |-- __next: string (nullable = true)
 |    |-- results: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- __metadata: struct (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- uri: string (nullable = true)
 |    |    |    |-- assignmentClass: string (nullable = true)
 |    |    |    |-- assignmentIdExternal: string (nullable = true)
 |    |    |    |-- compInfoNav: struct (nullable = true)

My code

df = df.select(col('d.__next').alias("nexttoken"), explode(col('d.results')).alias("result"))

Essentially during the loop at some point the __next value will disappear, but I still use this code it obviously then doesn't find it and errors.

Any help would be appreciated.


Solution

  • Since you want to check for the existence of __next field before using DataFrame.select(), you can use the following code. This code specifically works for the schema that you havee provided.

    d_fields = df.schema['d'].dataType.fieldNames() 
    
    # Type of d_fields is 'list', its values are String type
    # In your case, d_fields has values ['__next', 'results'] 
    
    if('__next' in d_fields): 
        df = df.select(col('d.__next').alias("nexttoken"), explode(col('d.results')).alias("result")) 
    

    When we use df.schema[‘d’].dataType.fieldNames() it returns a list of all the fields present in the d:struct column. So, you can use if conditional statement to check if '__next' exists in this list or not. At some point in the loop, when the d.__next field is no longer available, the if condition fails and does not throw an error.