Search code examples
pythonapache-sparkdataframeapache-spark-sqlrdd

Check Type: How to check if something is a RDD or a DataFrame?


I'm using Python, and this is a Spark RDD / DataFrame.

I tried isinstance(thing, RDD) but RDD wasn't recognized.

The reason I need to do this:

I'm writing a function where both RDD and DataFrame could be passed in, so I'll need to do input.rdd to get the underlying RDD if a DataFrame is passed in.


Solution

  • isinstance will work just fine:

    from pyspark.sql import DataFrame
    from pyspark.rdd import RDD
    
    def foo(x):
        if isinstance(x, RDD):
            return "RDD"
        if isinstance(x, DataFrame):
            return "DataFrame"
    
    foo(sc.parallelize([]))
    ## 'RDD'
    foo(sc.parallelize([("foo", 1)]).toDF())
    ## 'DataFrame'
    

    but single dispatch is much more elegant approach:

    from functools import singledispatch
    
    @singledispatch
    def bar(x):
        pass 
    
    @bar.register(RDD)
    def _(arg):
        return "RDD"
    
    @bar.register(DataFrame)
    def _(arg):
        return "DataFrame"
    
    bar(sc.parallelize([]))
    ## 'RDD'
    
    bar(sc.parallelize([("foo", 1)]).toDF())
    ## 'DataFrame'
    

    If you don't mind additional dependencies multipledispatch is also an interesting option:

    from multipledispatch import dispatch
    
    @dispatch(RDD)
    def baz(x):
        return "RDD"
    
    @dispatch(DataFrame)
    def baz(x):
        return "DataFrame"
    
    baz(sc.parallelize([]))
    ## 'RDD'
    
    baz(sc.parallelize([("foo", 1)]).toDF())
    ## 'DataFrame'
    

    Finally the most Pythonic approach is to simply check an interface:

    def foobar(x):
        if hasattr(x, "rdd"):
            ## It is a DataFrame
        else:
            ## It (probably) is a RDD