I was wondering what would be a good approach to tokenize a string such as:
"'The president' of the United States is Barack Obama"
So that it returns:
{The president, of, the, United States, is, Barack Obama}
After some looking around I managed to get the following regex:
([A-Z][a-zA-Z0-9-]*)([\s][A-Z][a-zA-Z0-9-]*)+|'([^']*?)'|[^\s{.,:;”’()?!}]+
Which seems to work for my purposes.
Sources: https://stackoverflow.com/a/4113082/6601606 https://stackoverflow.com/a/16746437/6601606