Search code examples
python-3.xkedro

What does this python function signature means in Kedro Tutorial?


I am looking at Kedro Library as my team are looking into using it for our data pipeline.

While going to the offical tutorial - Spaceflight.

I came across this function:

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocess the data for companies.

    Args:
        companies: Source data.
    Returns:
        Preprocessed data.

"""

companies["iata_approved"] = companies["iata_approved"].apply(_is_true)

companies["company_rating"] = companies["company_rating"].apply(_parse_percentage)

return companies
  • companies is the name of the csv file containing the data

Looking at the function, my assumption is that (companies: pd.Dafarame) is the shorthand to read the "companies" dataset as a dataframe. If so, I do not understand what does -> pd.Dataframe at the end means

I tried looking at python documentation regarding such style of code but I did not managed to find any

Much help is appreciated to assist me in understanding this.

Thank you


Solution

  • The -> notation is type hinting, as is the : part in the companies: pd.DataFrame function definition. This is not essential to do in Python but many people like to include it. The function definition would work exactly the same if it didn't contain this but instead read:

    def preprocess_companies(companies):
    

    This is a general Python thing rather than anything kedro-specific.

    The way that kedro registers companies as a kedro dataset is completely separate from this function definition and is done through the catalog.yml file:

    companies:
      type: pandas.CSVDataSet
      filepath: data/01_raw/companies.csv
    

    There will then a node defined (in pipeline.py) to specify that the preprocess_companies function should take as input the kedro dataset companies:

    node(
        func=preprocess_companies,
        inputs="companies",  # THIS LINE REFERS TO THE DATASET NAME
        outputs="preprocessed_companies",
        name="preprocessing_companies",
    ),
    

    In theory the name of the parameter in the function itself could be completely different, e.g.

    def preprocess_companies(anything_you_want):
    

    ... although it is very common to give it the same name as the dataset.