I have successfully trained a Google AutoML Natural Language model to do multi-label categorization of text using custom labels.
I'm also able to use the python function generated by the trained dataset to generate predictions on text contained in a Pandas DataFrame in a Jupyter Notebook.
However I'm not sure how to use the result and especially manipulate it so that it's useful to me.
Here's what my code looks like currently:
r = #api call to get text
df = pd.read_csv(StringIO(r.text), usecols=['text_to_predict'])
df['Category_Predicted'] = df.apply(lambda row: get_prediction(row.review, 'xxx', 'xxxx')
The output of df['Category_Predicted'].head() is
0 payload {\n classification {\n score: 0.61...
Name: Category_Predicted, dtype: object
And a simple (more readable) print of one prediction returns
payload {
classification {
score: 0.6122230887413025
}
display_name: "Shopping"
}
payload {
classification {
score: 0.608892023563385
}
display_name: "Search"
}
payload {
classification {
score: 0.38840705156326294
}
display_name: "Usability"
}
payload {
classification {
score: 0.2736874222755432
}
display_name: "Stability"
}
payload {
classification {
score: 0.011237740516662598
}
display_name: "Profile"
}
#......................(continues on for all categories)
Now, my primary objective would be for df['Category_Predicted'] to be a field where the topmost (most relevant) categories are comma separated in a simple list. The example above would be
Shopping, Search, Usability
(depending how far you want you want to keep labels based on the score)
So I have several some on my hands:
How to access with python this field to get the category and it's related score?
How to manipulate it to create a single string?
Thanks!
EDIT
As requested in comments, below some examples representing 2 records in my dataframe with (non-complete) payload where in the desired result I have filtered results with score > 0.3. Due to the large text fields I had to use a... "custom" solution for presentation instead of ascii tables
ROW 1 - TEXT TO PREDICT
Great app so far. Just a pity that you can not look in the old app what you still had in your shopping or what your favorites were. This fact is simply gone. Plus that you now have to enter everything in the new one !!!
ROW 1 - PREDICTION OUTPUT
payload {
classification {
score: 0.6122230887413025
}
display_name: "Shopping"
}
payload {
classification {
score: 0.608892023563385
}
display_name: "Search"
}
payload {
classification {
score: 0.38840705156326294
}
display_name: "Usability"
}
payload {
classification {
score: 0.2736874222755432
}
display_name: "Stability"
}
ROW 1 - DESIRED OUTPUT
Shopping, Search, Usability
ROW 2 - TEXT TO PREDICT
2nd time you make us the joke of a new app worse than the 1st. How long before raising the level with this one? Not intuitive at all, not so clear ... In short not at the level of the previous one
ROW 2 - PREDICTION OUTPUT
payload {
classification {
score: 0.9011210203170776
}
display_name: "Usability"
}
payload {
classification {
score: 0.8007309436798096
}
display_name: "Shopping"
}
payload {
classification {
score: 0.5114057660102844
}
display_name: "Stability"
}
payload {
classification {
score: 0.226901113986969
}
display_name: "Search"
}
ROW 2 - DESIRED OUTPUT
Usability, Shopping, Stability
I know it's bad to answer my own question, but I figured if somebody looks for the same problem, they might find a solution.
As google.cloud.automl_v1beta1 defines it, the return value of method get_prediction is an object of type PredictResponse ( https://cloud.google.com/natural-language/automl/docs/reference/rpc/google.cloud.automl.v1beta1#predictresponse )
Using the documentation and available structure of such object I found this code does the trick
for index, row in df.iterrows():
pred = get_prediction(row['review'], GCP_PROJ, AUTOML_DS)
filteredCategories = filter(filterPrediction, pred.payload)
df.at[index,'Predicted_Categories'] = ",".join([str(categ.display_name) for categ in filteredCategories])