Search code examples

Tokenizer.from_file() HUGGINFACE : Exception: data did not match any variant of untagged enum ModelWrapper

I am having issue loading a Tokenizer.from_file() BPE tokenizer. When I try I am encountering this error where the line 11743 is the last last one: Exception: data did not match any variant of untagged enum ModelWrapper at line 11743 column 3 I have no idea what is the problem and how to solve it does anyone have some clue? I did not train directly the BPE but the structure is the correct one so vocab and merges in a json. What I did was from a BPE trained by me (that was working) change completely the vocab and the merges based on something manually created by me (without a proper train). But I don't see the problem since the structure should be the same as the original one. My tokenizer version is: 0.13.1



      "QD FLPDSITF",
      "QPHY AS",
      "LR SE",
      "A DRV"
    ] #11742
  } #11743
} #11744


  • When I encountered this problem the root cause was a missing pre_tokenizer so in my case adding Whitespace pre tokenizer solved the issue.

    Here is an example:

    tokenizer = Tokenizer(BPE())
    tokenizer.pre_tokenizer = Whitespace()