Search code examples
stringalgorithmtokenizetext-segmentation

How to split concatenated strings of this kind: "howdoIsplitthis?"


Suppose I have a string such as this:

"IgotthistextfromapdfIscraped.HowdoIsplitthis?"

And I want to produce:

"I got this text from a pdf I scraped. How do I split this?"

How can I do it?


Solution

  • It turns out that this task is called word segmentation, and there is a python library that can do that:

    >>> from wordsegment import load, segment
    >>> load()
    >>> segment("IgotthistextfromapdfIscraped.HowdoIsplitthis?")
    ['i', 'got', 'this', 'text', 'from', 'a', 'pdf', 'i', 'scraped', 'how',
     'do', 'i', 'split', 'this']