Search code examples
google-cloud-platformtext-classificationgoogle-natural-language

Google Cloud Natural Language API Classifying Plaintext vs Html


I want to use Google Natural Language API to classify query results: Classifying content

The query results, which I want to classify, are available in HTML and plain text. The official documentation says that the API accepts both types Document.Type.PLAIN_TEXT and Document.Type.HTML.

Because the HTML format has additional annotations like e.g. <b>important text</b>, I am wondering which format is better to achieve the best classification result possible?


Solution

  • (not sure if this response is still useful or not.) Sometimes html pages have a lot of unimportant pieces around the main center piece. Those could easily affect the classification of the content (e.g. ads around the main content). The html handling in the API basically tries to prune these sections and only deal with the main piece. If your html file needs this type of handling, it'd be better to use HTML type when calling the API.