Search code examples
javakotlinjacksonyaml

How to parse large YAML file in Java or Kotlin?


I have a large YAML file (~5MB) and I need to parse it using Kotlin/JVM.

I tried using the streaming API of Jackson 2.14.1, but it throws:

com.fasterxml.jackson.dataformat.yaml.JacksonYAMLParseException: The incoming YAML document exceeds the limit: 3145728 code points.
 at [Source: (ZipInputStream); line: 122415, column: 9]
...
Caused by: org.yaml.snakeyaml.error.YAMLException: The incoming YAML document exceeds the limit: 3145728 code points.

My YAML file is a large dictionary with roughly 5k keys, and a small document is associated to each key. I stream the root keys and parse each associated document with the JsonParser.readValueAs() method. Since I was streaming, I expected there would be no issue regarding the size of the dictionary, as long as each sub-document is small enough. But well, there is. I checked the document that fails to parse, at line 122415, and it is neither large (it is 1.5KB) nor ill formed (according to https://www.yamllint.com/).

My code is:

@Service
class Parser(
    @Qualifier("yamlMapper") private val yamlMapper: ObjectMapper,
) {
    fun parse(input: InputStream): Flow<Item> = flow {
        val parser = yamlMapper.factory.createParser(input)
        parser.use {
            parser.requireToken(JsonToken.START_OBJECT)
            var token = parser.nextToken()
            while (token != JsonToken.END_OBJECT) {
                if (token != JsonToken.FIELD_NAME) {
                    throw JsonParseException(parser, "Expected FIELD_NAME but was $token")
                }
                parser.requireToken(JsonToken.START_OBJECT)
                emit(parser.readValueAs(Item::class.java))
                token = parser.nextToken()
            }
            parser.requireToken(null)
        }
    }
}

fun JsonParser.requireToken(expected: JsonToken?) {
    val actual = nextToken()
    if (actual != expected) {
        throw JsonParseException(this, "Expected ${expected ?: "end of file"} but was $actual")
    }
}

Solution

  • After digging Jackson's documentation, it turns out this is quite easy. I needed to configure the YAMLFactory when creating the ObjectMapper:

    @SpringBootApplication
    class Main {
        @Bean
        fun yamlMapper(): ObjectMapper =
            ObjectMapper(YAMLFactory.builder()
                .loaderOptions(LoaderOptions().apply {
                    codePointLimit = 100 * 1024 * 1024 // 100MB
                })
                .build()
            )
    }
    

    See Maximum input YAML document size (3 MB).