Search code examples
javastringkotlintokenize

Java/Kotlin: Tokenize a string ignoring the contents of nested quotes


I would like to split a character by spaces but keep the spaces inside the quotes (and the quotes themselves). The problem is, the quotes can be nested, and also I would need to do this for both single and double quotes. So, from the line this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes" I would like to get [this, "'"is a possible option"'", and, ""so is this"", and, '''this one too''', and, even, ""mismatched quotes"].

This question has already been asked, but not the exact question that I'm asking. Here are several solutions: one uses a matcher (in this case """x""" would be split into [""", x"""], so this is not what I need) and Apache Commons (which works with """x""" but not with ""x"", since it takes the first two double quotes and leaves the last two with x). There are also suggestions of writing a function to do so manually, but this would be the last resort.


Solution

  • You can achieve that with the following regex: ["']+[^"']+?["']+. Using that pattern you retrieve the indices where you want to split like this:

    val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
    

    The rest is building the list out of substrings. Here the complete function:

    fun String.splitByPattern(pattern: String): List<String> {
    
        val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
    
        var lastIndex = 0
        return indices.mapIndexed { i, ele ->
    
            val end = if(i % 2 == 0) ele else ele + 1 // magic
    
            substring(lastIndex, end).apply {
                lastIndex = end
            }
        }
    }
    

    Usage:

    val str = """
    this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes"
    """.trim()
    
    println(str.splitByPattern("""["']+[^"']+?["']+"""))
    

    Output:

    [this , "'"is a possible option"'", and , ""so is this"", and , '''this one too''', and even , ""mismatched quotes"]

    Try it out on Kotlin's playground!