Search code examples
pysparkrdd

How to type hint a function that transforms an RDD?


Given a StructType schema I want to be able to define

def foo(row: schema)
 return row.field

and have PyCharm recognize the fields of row, but PyCharm does not recognize 'schema' as a type. Inlining makes no difference. (I'm using Python 3.8)


Solution

  • It's not technically correct; row is a Row, but it works just fine thanks to duck typing:

    from dataclasses import dataclass
    
    @dataclass
    class HintedRow:
      x: int
      y: str
    
    def foo(row: HintedRow):
      return row.x
    
    df.rdd.map(foo)
    

    Now you can use it in unit tests like so and pyspark will not complain because HintedRow's properties are the same as those of the Row:

    test_row = HintedRow(x=1, y='bar')
    assert foo(test_row) == 1