Search code examples
javastringquotation-marks

Split/tokenize/scan a string being aware of quotation marks


Is there a default/easy way in Java for split strings, but taking care of quotation marks or other symbols?

For example, given this text:

There's "a man" that live next door 'in my neighborhood', "and he gets me down..."

Obtain:

There's
a man
that
live
next
door
in my neighborhood
and he gets me down

Solution

  • Something like this works for your input:

        String text = "There's \"a man\" that live next door "
            + "'in my neighborhood', \"and he gets me down...\"";
    
        Scanner sc = new Scanner(text);
        Pattern pattern = Pattern.compile(
            "\"[^\"]*\"" +
            "|'[^']*'" +
            "|[A-Za-z']+"
        );
        String token;
        while ((token = sc.findInLine(pattern)) != null) {
            System.out.println("[" + token + "]");
        }
    

    The above prints (as seen on ideone.com):

    [There's]
    ["a man"]
    [that]
    [live]
    [next]
    [door]
    ['in my neighborhood']
    ["and he gets me down..."]
    

    It uses Scanner.findInLine, where the regex pattern is one of:

    "[^"]*"      # double quoted token
    '[^']*'      # single quoted token
    [A-Za-z']+   # everything else
    

    No doubt this doesn't work 100% always; cases where quotes can be nested etc will be tricky.

    References