how to specify different types of DataFrames in python?

Let's say that I have a Pyspark DataFrame which I consider is "Users". Then I have another one which I consider "Cars".

Now lets say that I have a function which return a dataframe of type "Cars".

Usually I see code like this:

def get_cars() -> Dataframe:
    pass

However "Dataframe" is not very expressive....is too generic. So, is it possible to specify something like this using alias or similar?:

def get_data() -> Cars: 
    pass

Solution

You could use type :

builtins.type now supports subscripting ([]). See PEP 585 and Generic Alias Type.

Source : [docs]

Cars = type[pd.DataFrame]

def get_data() -> Cars:
    ...
    
print(Cars) # type[pandas.core.frame.DataFrame]

From the comments :

.. this type hint won't say anything about the columns that this Cars Dataframe contain.

In this case, you may be tempted to use pandera (which also supports PySpark SQL):

#pip install pandera
import pandera as pa
                          
Cars = pa.DataFrameSchema({
    "Model": pa.Column(pa.String),
    "Year": pa.Column(pa.Int),
})

def get_cars() -> Cars :
    return pd.DataFrame({
        "Model": ["Lambo", "Porshe", "Mustang"],
        "Year": [2023, 2000, 2010],
    })

Output :

print(Cars.dtypes) # {'Model': DataType(str), 'Year': DataType(int64)}

If you need to validate the schema, you can try this :

df = get_cars()

try:
    Cars.validate(df.astype({"Year": "float"}))
except pa.errors.SchemaError as e:
    print(f"WRONG SCHEMA: {e}")

# WRONG SCHEMA: expected series 'Year' to have type int64, got float64