json dataframe apache-spark pyspark typeerror

PySpark: TypeError: col should be Column

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:

The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:

However, when I try:

dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

it throws error:

PySpark: TypeError: col should be Column

There is no such problem with any other of the keys in the dict, i.e. "value".

I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?

Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:

schemapath = 'default_schema.json'
with open(schemapath) as f:
  d = json.load(f)
  schemaNew = StructType.fromJson(d)
  responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)

however, line

schemaNew = StructType.fromJson(d)

throws following error:

KeyError: 'fields'

No idea, where this 'fields' is coming from...

Solution

Errors in Spark tell truth.

dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)

TypeError: col should be Column

DataFrame.withColumn documentation tells you how its input parameters are called and their data types:

Parameters:
- colName: str
string, name of the new column.
- col: Column
a Column expression for the new column.

So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?

print(type(dataframe.keywords_exp.name))
# <class 'method'>

As can be seen, it's not of the expected type Column...

To get Column from Struct's field, you must use a different syntax.

Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.

To access struct fields, you should be using any of the following options:

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])

(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)

This is a dataframe having the same schema as yours. You can see that withColumn works well:

from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
    [(("N", "glocations", 1, "Cuba"),)],
    'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
#  |-- keywords_exp: struct (nullable = true)
#  |    |-- major: string (nullable = true)
#  |    |-- name: string (nullable = true)
#  |    |-- rank: long (nullable = true)
#  |    |-- value: string (nullable = true)

df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))

df.show()
# +--------------------+-------------+
# |        keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...|   glocations|
# +--------------------+-------------+