I'm trying to figure out which API I should use to get Google to intelligently split a string into words.
the quick brown fox jumps over the lazy dog
When I go to Google Translate and input the string (with auto-detect language) and click on the "Listen" icon for Google to read out the string, it breaks up the words and reads it out correctly. So, I know they're able to do it.
But what I can't figure out is if it's the API for Google Translate or their Text-To-Speech API that's breaking up the words. Or if there's any way to get those broken up words in an API response somewhere.
Does anyone have experience using Google's APIs to do this?
AFAIK, there isn't an API in Google Cloud that does that specifically, although, it looks like when you translate text using the Translation API it is indeed parsing the concatenated words in the background.
So, as you can't use it with the same source language as the target language, what you could do is translate to any language and then translate back to the original language. This seems a bit overkill though.
You could create a Feature Request to ask for such a feature to be implemented in the NLP API for example.
But, depending on your use case, I suppose that you could also use the method suggested in this other Stackoverflow Answer that uses dynamic programming to infer the location of spaces in a string without spaces.
Another user even made a pip package named wordninja (See second answer on the same post) based on that.
pip3 install wordninja
to install it.
Example usage:
$ python
>>> import wordninja
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']