algorithm computer-science nlp string-algorithm

Find the words in a long stream of characters. Auto-tokenize

How would you find the correct words in a long stream of characters?

Input :

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

Google's Output:

"The revised report on syntactic theories sequential controlandstate"

(which is close enough considering the time that they produced the output)

How do you think Google does it? How would you increase the accuracy?

Solution

I would try a recursive algorithm like this:

Try inserting a space at each position. If the left part is a word, then recur on the right part.
Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.

For example, giving it "thesentenceisgood" would run:

thesentenceisgood
the sentenceisgood
    sent enceisgood
         enceisgood: OUT1: the sent enceisgood, 2/3
    sentence isgood
             is good
                go od: OUT2: the sentence is go od, 4/5
             is good: OUT3: the sentence is good, 4/4
    sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
      ntenceisgood: OUT5: these ntenceisgood, 1/2

So you would pick OUT3 as the answer.