Search code examples
python-3.xgoogle-cloud-platformcloud-document-ai

Does google document AI support thai language?


For a thai language document, I am trying to extract the text and key-value pairs using Google Document AI. On seeing the result, I can see that the thai language is not retained. Is there any parameter that needs to be passed for thai characters to be recognised as I can get the output only in english characters.

Below link shows that document-ai can support thai language as well. https://cloud.google.com/document-ai/docs/languages


Solution

  • The Supported Language Documentation specifically refers to languages supported by Optical Character Recognition.

    Specific processors may support limited languages. Since you said you're using "key-value pairs", it sounds like you're using the Form Parser which says on the Processor page that it supports Latin Script languages only. (Which does not include Thai)

    enter image description here

    The documentation could be more clear about language support for individual processors, there is currently work being done to address this.

    Update 1: The Supported Languages Documentation has been updated to make this more explicit.

    The Processor List page also shows language support for each processor type.

    Update 2: The newest version of the Form Parser processor pretrained-form-parser-v2.0-2022-11-10 adds support for all 200+ languages supported by the Document OCR processor, which should include the Thai language.

    Refer to Managing processor versions for info on how to use this.