How to solve mypy error "value of type row | none is not indexable" for pyspark dataframe?

Mypy throws the error "Value of type 'Row | None' is not indexable" for the line starting with "x=":

from pyspark.sql import DataFrame
from pyspark.sql import functions as f

def somefunction(df: DataFrame, column_name: str) -> DataFrame:
    x = df.select(f.min(f.col(column_name))).first()[0] 
    return df.withColumn('newcolumn', f.col(column_name) + x)

How can I add a type check that passes mypy?

Solution

df.select(...) returns a value of type Row | None, which means the actual return value might be a Row, or it might be None. You can't index None, so you can't index a value of type Row | None until you establish that it definitely isn't None. (Essentially, the interface of a union type is the intersection of the individual types: you can only do with A | B what you can do to A and B.)

One way to do that is to use type narrowing: by checking if the return value is None, you can branch into code where the static type is NoneType, or into code where the static type is Row.

# reveal_type(result) == Row | None
result = df.select(f.min(f.col(column_name))).first()

if result is None:
    # reveal_type(result) == None
    do what needs to be done if no row is returned
else:
    # reveal_type(result) == Row
    x = result[0]
    return df.withColumn('new column', f.col(column_name) + x)

You might know that the particular argument to df.select cannot fail, but mypy does not.

Another option, if you are absolutely sure you will get a Row back, is to use cast to let mypy in on the secret.

x = typing.cast(Row, df.select(...))[0]
return df.withColumn(...)

This is risky, though. mypy will believe the cast, and if df.select(...) does return None, you'll get a runtime error even though mypy says it's OK.