Search code examples
javaregexindexingtokenize

Java - Tokenizing a string based on quotes and capital case


I was wondering what would be a good approach to tokenize a string such as:

"'The president' of the United States is Barack Obama"

So that it returns:

{The president, of, the, United States, is, Barack Obama}

Solution

  • After some looking around I managed to get the following regex:

    ([A-Z][a-zA-Z0-9-]*)([\s][A-Z][a-zA-Z0-9-]*)+|'([^']*?)'|[^\s{.,:;”’()?!}]+
    

    Which seems to work for my purposes.

    Sources: https://stackoverflow.com/a/4113082/6601606 https://stackoverflow.com/a/16746437/6601606