Search code examples
pythonpysparkapache-spark-sql

Can I change the nullability of a column in my Spark dataframe?


I have a StructField in a dataframe that is not nullable. Simple example:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

which returns:

[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]

Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

which failed with:

TypeError: StructField(name,StringType,true) is not JSON serializable

I also see this in the stack trace:

raise ValueError("Circular reference detected")

So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?


Solution

  • Seems you missed the StructType(newSchema).

    l = [('Alice', 1)]
    df = sqlContext.createDataFrame(l, ['name', 'age'])
    df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
    df.schema.fields
    newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
    df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
    df2.show()