Search code examples
pandasdataframepyspark

how to specify different types of DataFrames in python?


Let's say that I have a Pyspark DataFrame which I consider is "Users". Then I have another one which I consider "Cars".

Now lets say that I have a function which return a dataframe of type "Cars".

Usually I see code like this:

def get_cars() -> Dataframe:
    pass

However "Dataframe" is not very expressive....is too generic. So, is it possible to specify something like this using alias or similar?:

def get_data() -> Cars: 
    pass

Solution

  • You could use type :

    builtins.type now supports subscripting ([]). See PEP 585 and Generic Alias Type.

    Source : [docs]

    Cars = type[pd.DataFrame]
    
    def get_data() -> Cars:
        ...
        
    print(Cars) # type[pandas.core.frame.DataFrame]
    

    From the comments :

    .. this type hint won't say anything about the columns that this Cars Dataframe contain.

    In this case, you may be tempted to use pandera (which also supports PySpark SQL):

    #pip install pandera
    import pandera as pa
                              
    Cars = pa.DataFrameSchema({
        "Model": pa.Column(pa.String),
        "Year": pa.Column(pa.Int),
    })
    
    def get_cars() -> Cars :
        return pd.DataFrame({
            "Model": ["Lambo", "Porshe", "Mustang"],
            "Year": [2023, 2000, 2010],
        })
    

    Output :

    print(Cars.dtypes) # {'Model': DataType(str), 'Year': DataType(int64)}
    

    If you need to validate the schema, you can try this :

    df = get_cars()
    
    try:
        Cars.validate(df.astype({"Year": "float"}))
    except pa.errors.SchemaError as e:
        print(f"WRONG SCHEMA: {e}")
    
    # WRONG SCHEMA: expected series 'Year' to have type int64, got float64