Let's say that I have a Pyspark DataFrame which I consider is "Users". Then I have another one which I consider "Cars".
Now lets say that I have a function which return a dataframe of type "Cars".
Usually I see code like this:
def get_cars() -> Dataframe:
pass
However "Dataframe" is not very expressive....is too generic. So, is it possible to specify something like this using alias or similar?:
def get_data() -> Cars:
pass
You could use type
:
builtins.type
now supports subscripting ([]
). See PEP 585 and Generic Alias Type.Source : [docs]
Cars = type[pd.DataFrame]
def get_data() -> Cars:
...
print(Cars) # type[pandas.core.frame.DataFrame]
From the comments :
.. this type hint won't say anything about the columns that this Cars Dataframe contain.
In this case, you may be tempted to use pandera
(which also supports PySpark SQL):
#pip install pandera
import pandera as pa
Cars = pa.DataFrameSchema({
"Model": pa.Column(pa.String),
"Year": pa.Column(pa.Int),
})
def get_cars() -> Cars :
return pd.DataFrame({
"Model": ["Lambo", "Porshe", "Mustang"],
"Year": [2023, 2000, 2010],
})
Output :
print(Cars.dtypes) # {'Model': DataType(str), 'Year': DataType(int64)}
If you need to validate the schema, you can try this :
df = get_cars()
try:
Cars.validate(df.astype({"Year": "float"}))
except pa.errors.SchemaError as e:
print(f"WRONG SCHEMA: {e}")
# WRONG SCHEMA: expected series 'Year' to have type int64, got float64