Search code examples
pythonpandasazureazure-databricksazure-machine-learning-service

Getting the error while removing the duplicates in python AzureML classification problem


I'm getting this error while calling drop.duplicate function:

Traceback (most recent call last):
  File "train.py", line 159, in <module>
    orders_dfx = preprocess_orders(orders_df)
  File "train.py", line 20, in preprocess_orders
    ao = ao.drop_duplicates(subset=['order_id'], keep='last')
AttributeError: 'TabularDataset' object has no attribute 'drop_duplicates'

Here is a part of train.py code

def preprocess_orders(ao):
  ao = ao.drop_duplicates(subset=['order_id'], keep='last')
  ao['order_id'] = ao['order_id'].astype('str')
  ao['class'] = ao['class'].astype('int')
  ao['age'] = ao['age'].astype('float').fillna(ao['age'].mean()).round(2)
  return ao

orders_df = Dataset.get_by_name(ws, name='class_cancelled_orders')
orders_df.to_pandas_dataframe()
# Doing processing
orders_dfx = preprocess_orders(orders_df)

I'm getting the data from the datasets in azureml studio. The job.py file is used for running experiment as:

# submit job
run = Experiment(ws, experiment_name).submit(src)
run.wait_for_completion(show_output=True)

Solution

  • The to_pandas_dataframe()method returns a pandas DataFrame, so you need to assign it back your variable:

    orders_df = orders_df.to_pandas_dataframe()