ibm-cloud ibm-watson personality-insights

watson special character, repeat posts, and url handling

in using watson personality insights API, i've already note some odd trends, including many scored at a mean value across dimensions (e.g. agreeableness with many around .27), making me thing it's imputing to something.

Upon review I've note a language misalign issue (i.e. if it thinks it's english, you could get weird results if it's, say spanish), which has lead me to ask, but not find answer to:

how does watson handle: 1) urls in the message (e.g. many twitter posts have urls) 2) repeat posts (many channels repeat post things many times) 3) special characters (many posts have a ton of random special characters)

My goal is to determine how much pre-processing I need to do to make watson most effective.

Solution

You are correct that if the language is mis-aligned then you will get incorrect results.

The Pi API determines the language first from the content-language header. If that is missing then if the content-type is json, then it looks at the language in the json content, selecting the language that has the highest number of occurrences, and finally, if that is missing it will default to the default language, namely English.

So in short, the recommendation (which will become required in a future update), is to always send in the content-language header.

Secondly, to your question on the content: - URLs: the service will attempt to remove these. I won't guarantee that it removes every possible option as the url spec has some very esoteric options but we will remove the common formats. - Repeat Posts: if you send in the same post twice, then it will be counted twice. We do no de-duplication in the text that is sent into the service. - Special Characters; I'm assuming you are referring to emojis here. These are included in our processing as the underlying models were trained on data that included them as well, and thus they are one of the many signals the service uses.