Search code examples
pythonpysparkmypy

How to solve mypy error "value of type row | none is not indexable" for pyspark dataframe?


Mypy throws the error "Value of type 'Row | None' is not indexable" for the line starting with "x=":

from pyspark.sql import DataFrame
from pyspark.sql import functions as f

def somefunction(df: DataFrame, column_name: str) -> DataFrame:
    x = df.select(f.min(f.col(column_name))).first()[0] 
    return df.withColumn('newcolumn', f.col(column_name) + x)

How can I add a type check that passes mypy?


Solution

  • df.select(...) returns a value of type Row | None, which means the actual return value might be a Row, or it might be None. You can't index None, so you can't index a value of type Row | None until you establish that it definitely isn't None. (Essentially, the interface of a union type is the intersection of the individual types: you can only do with A | B what you can do to A and B.)

    One way to do that is to use type narrowing: by checking if the return value is None, you can branch into code where the static type is NoneType, or into code where the static type is Row.

    # reveal_type(result) == Row | None
    result = df.select(f.min(f.col(column_name))).first()
    
    if result is None:
        # reveal_type(result) == None
        do what needs to be done if no row is returned
    else:
        # reveal_type(result) == Row
        x = result[0]
        return df.withColumn('new column', f.col(column_name) + x)
    

    You might know that the particular argument to df.select cannot fail, but mypy does not.


    Another option, if you are absolutely sure you will get a Row back, is to use cast to let mypy in on the secret.

    x = typing.cast(Row, df.select(...))[0]
    return df.withColumn(...)
    

    This is risky, though. mypy will believe the cast, and if df.select(...) does return None, you'll get a runtime error even though mypy says it's OK.