Search code examples
pythonpandasdataframegoogle-cloud-automl

How to manipulate prediction response from Google AutoML in a Pandas DataFrame?


I have successfully trained a Google AutoML Natural Language model to do multi-label categorization of text using custom labels.

I'm also able to use the python function generated by the trained dataset to generate predictions on text contained in a Pandas DataFrame in a Jupyter Notebook.

However I'm not sure how to use the result and especially manipulate it so that it's useful to me.

Here's what my code looks like currently:

r = #api call to get text
df = pd.read_csv(StringIO(r.text), usecols=['text_to_predict'])
df['Category_Predicted'] = df.apply(lambda row: get_prediction(row.review, 'xxx', 'xxxx')

The output of df['Category_Predicted'].head() is

0    payload {\n  classification {\n    score: 0.61...
Name: Category_Predicted, dtype: object

And a simple (more readable) print of one prediction returns

payload {
  classification {
    score: 0.6122230887413025
  }
  display_name: "Shopping"
}
payload {
  classification {
    score: 0.608892023563385
  }
  display_name: "Search"
}
payload {
  classification {
    score: 0.38840705156326294
  }
  display_name: "Usability"
}
payload {
  classification {
    score: 0.2736874222755432
  }
  display_name: "Stability"
}
payload {
  classification {
    score: 0.011237740516662598
  }
  display_name: "Profile"
}
#......................(continues on for all categories)

Now, my primary objective would be for df['Category_Predicted'] to be a field where the topmost (most relevant) categories are comma separated in a simple list. The example above would be

Shopping, Search, Usability

(depending how far you want you want to keep labels based on the score)

So I have several some on my hands:

  • How to access with python this field to get the category and it's related score?

  • How to manipulate it to create a single string?

Thanks!

EDIT

As requested in comments, below some examples representing 2 records in my dataframe with (non-complete) payload where in the desired result I have filtered results with score > 0.3. Due to the large text fields I had to use a... "custom" solution for presentation instead of ascii tables

ROW 1 - TEXT TO PREDICT

Great app so far. Just a pity that you can not look in the old app what you still had in your shopping or what your favorites were. This fact is simply gone. Plus that you now have to enter everything in the new one !!!

ROW 1 - PREDICTION OUTPUT

payload {
  classification {
    score: 0.6122230887413025
  }
  display_name: "Shopping"
}
payload {
  classification {
    score: 0.608892023563385
  }
  display_name: "Search"
}
payload {
  classification {
    score: 0.38840705156326294
  }
  display_name: "Usability"
}
payload {
  classification {
    score: 0.2736874222755432
  }
  display_name: "Stability"
}

ROW 1 - DESIRED OUTPUT

Shopping, Search, Usability

ROW 2 - TEXT TO PREDICT

2nd time you make us the joke of a new app worse than the 1st. How long before raising the level with this one? Not intuitive at all, not so clear ... In short not at the level of the previous one

ROW 2 - PREDICTION OUTPUT

payload {
  classification {
    score: 0.9011210203170776
  }
  display_name: "Usability"
}
payload {
  classification {
    score: 0.8007309436798096
  }
  display_name: "Shopping"
}
payload {
  classification {
    score: 0.5114057660102844
  }
  display_name: "Stability"
}
payload {
  classification {
    score: 0.226901113986969
  }
  display_name: "Search"
}

ROW 2 - DESIRED OUTPUT

Usability, Shopping, Stability


Solution

  • I know it's bad to answer my own question, but I figured if somebody looks for the same problem, they might find a solution.

    As google.cloud.automl_v1beta1 defines it, the return value of method get_prediction is an object of type PredictResponse ( https://cloud.google.com/natural-language/automl/docs/reference/rpc/google.cloud.automl.v1beta1#predictresponse )

    Using the documentation and available structure of such object I found this code does the trick

    for index, row in df.iterrows():
        pred = get_prediction(row['review'], GCP_PROJ, AUTOML_DS)
        filteredCategories = filter(filterPrediction, pred.payload)
        df.at[index,'Predicted_Categories'] = ",".join([str(categ.display_name) for categ in filteredCategories])