Search code examples
algorithmcomputer-sciencenlpstring-algorithm

Find the words in a long stream of characters. Auto-tokenize


How would you find the correct words in a long stream of characters?

Input :

"The revised report onthesyntactictheoriesofsequentialcontrolandstate"

Google's Output:

"The revised report on syntactic theories sequential controlandstate"

(which is close enough considering the time that they produced the output)

How do you think Google does it? How would you increase the accuracy?


Solution

  • I would try a recursive algorithm like this:

    • Try inserting a space at each position. If the left part is a word, then recur on the right part.
    • Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.

    For example, giving it "thesentenceisgood" would run:

    thesentenceisgood
    the sentenceisgood
        sent enceisgood
             enceisgood: OUT1: the sent enceisgood, 2/3
        sentence isgood
                 is good
                    go od: OUT2: the sentence is go od, 4/5
                 is good: OUT3: the sentence is good, 4/4
        sentenceisgood: OUT4: the sentenceisgood, 1/2
    these ntenceisgood
          ntenceisgood: OUT5: these ntenceisgood, 1/2
    

    So you would pick OUT3 as the answer.