I am looking at Kedro Library as my team are looking into using it for our data pipeline.
While going to the offical tutorial - Spaceflight.
I came across this function:
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocess the data for companies.
Args:
companies: Source data.
Returns:
Preprocessed data.
"""
companies["iata_approved"] = companies["iata_approved"].apply(_is_true)
companies["company_rating"] = companies["company_rating"].apply(_parse_percentage)
return companies
Looking at the function, my assumption is that (companies: pd.Dafarame)
is the shorthand to read the "companies" dataset as a dataframe. If so, I do not understand what does -> pd.Dataframe
at the end means
I tried looking at python documentation regarding such style of code but I did not managed to find any
Much help is appreciated to assist me in understanding this.
Thank you
The ->
notation is type hinting, as is the :
part in the companies: pd.DataFrame
function definition. This is not essential to do in Python but many people like to include it. The function definition would work exactly the same if it didn't contain this but instead read:
def preprocess_companies(companies):
This is a general Python thing rather than anything kedro-specific.
The way that kedro registers companies
as a kedro dataset is completely separate from this function definition and is done through the catalog.yml file:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
There will then a node defined (in pipeline.py) to specify that the preprocess_companies
function should take as input the kedro dataset companies
:
node(
func=preprocess_companies,
inputs="companies", # THIS LINE REFERS TO THE DATASET NAME
outputs="preprocessed_companies",
name="preprocessing_companies",
),
In theory the name of the parameter in the function itself could be completely different, e.g.
def preprocess_companies(anything_you_want):
... although it is very common to give it the same name as the dataset.