Search code examples
parsingyamlpyyaml

How do I skip document with errors in YAML stream?


It does not appear that the Python library pyyaml will allow me to read a multi-document YAML stream and continue past the point of an parsing error. I have two related questions:

  1. Am I just missing something, and some other API will support this?
  2. Do parsers in other programming languages support this operation? (if so, which)

Here is an example of a multiple-document YAML stream:

%YAML 1.1
---
# YAML can contain comments like this
name: David
age: 55
---
name: Mei
age: 50     # Including end-of-line
---
name: Juana: ERROR
age: 47
...
---
name: Adebayo
age: 58
...

I would like code similar to this to skip the bad document, but figure out "no matter how bad this document is, something new starts after the ... and ---.

with open('data/multidoc-bad.yaml') as stream:
    docs = yaml.load_all(stream)
    while True:
        try:
            doc = next(docs)
            print(doc)
        except StopIteration:
            break
        except Exception as err:
            print(err)

I'd like to get:

{'name': 'David', 'age': 55}
{'name': 'Mei', 'age': 50}
mapping values are not allowed here
  in "data/multidoc-bad.yaml", line 10, column 12
{'name': 'Adebayo', 'age': 58}

But in reality I do not get that last line for "Adebayo."

I recognize that I could write a small parser myself that reads lines and only looks for ... and --- lines to chunk the stream. Then pass only single documents to yaml.loads() after my own parsing. But it sure seems like that's what a parser is supposed to do for me.


Solution

  • Am I just missing something, and some other API will support this?

    No, PyYAML cannot do this.

    Do parsers in other programming languages support this operation? (if so, which)

    None that I know of. Most YAML parsers are hand-written with quite some being translations from PyYAML. I don't know a single one that implements error recovery. (I worked with SnakeYAML, go-yaml, PyYAML, libyaml, YamlDotNet, and authored NimYAML and AdaYaml.)

    But it sure seems like that's what a parser is supposed to do for me.

    I think the reasons why parsers don't support this include

    • writing a compliant parser for YAML is already very complex without error recovery,
    • the multi-document feature is seldom used and therefore little effort is put into enhancing it,
    • this is the only case where it is obvious how to implement error recovery; I would argue that inside a YAML document, it is nigh impossible to implement useful error recovery, and therefore error recovery is not seen as an obvious feature to implement,
    • the workaround is very simple (you described it yourself).