Search code examples
loopskotlinjvmtokenlexer

Tokens not in order for tokenization / lexer (kotlin)


I am creating a tokenization system in Kotlin / JVM that takes in a file and returns each char or sequence of chars as a token. For some reason, whenever I tokenized a string, it finds the second instance of s "string" token before moving onto the next token, or in other words, the tokens are not in order. I think it might have to do with the loop, but I just can't figure it out. I am still learning Kotlin, so if anyone could give me pointers as well, that'd be great! Much appreciated any help.

output of tokens :

[["chello", string], ["tomo", string], [:, keyV], ["hunna", string], ["moobes", string], ["hunna", string]]

My file looks like this.

STORE "chello" : "tomo" as 1235312

SEND "hunna" in Hollo

GET "moobes"

GET "hunna"
fun tokenCreator (file: BufferedReader)   {
    var lexicon : String = file.readText()

    val numRegex  = Regex("^[1-9]\\d*(\\.\\d+)?\$")
    val dataRegex = Regex("[(){}]")
    val token = mutableListOf<List<Any>>()

    for((index, char) in lexicon.withIndex()) {

        println(char)
      when {
            char.isWhitespace() -> continue

            char.toString() == ":" -> token.add(listOf(char.toString(), "keyV") )

            char.toString().matches(Regex("[()]")) -> token.add(listOf(char, "group") )

            char.toString().matches(dataRegex) -> token.add(listOf(char, "data_group" ) )

            char == '>' -> token.add(listOf(char.toString(), "verbline") )

            char == '"' -> {

                var stringOf = ""
                val firstQuote = lexicon.indexOf(char)
                val secondQuote = lexicon.indexOf(char, firstQuote + 1)

                if(firstQuote == -1 || secondQuote == -1) {
                    break
                    }
                for(i in firstQuote..secondQuote) {
                    stringOf += lexicon[i]
                    }
                lexicon = lexicon.substring(secondQuote + 1, lexicon.length)
                 token.add(listOf(stringOf, "string"))
             }
           }

       }

        println(token)

    }


Solution

  • Changing the content while iterating seems like a recipe for confusion...

    And you don't seem to increment the index to skip over consumed content. I'd recommend to change the loop in a way that allows you to skip over content you have consumed

    I'd also remove this line:

    lexicon = lexicon.substring(secondQuote + 1, lexicon.length)
    

    Then replace

      val firstQuote = lexicon.indexOf(char)
    

    with

      val firstQuote = index
    

    You can also use substring instead of iteration for stringOf

      val stringOf = lexicon.substring(
    

    Moreover, using toString to check for ':' seems inefficient