Search code examples
parsingf#fparsec

Using FParsec to parse possibly malformed input


I'm writing a parser for a specific file format using FParsec as a firstish foaray into learning fsharp. Part of the file has the following format

{ 123 456 789 333 }

Where the numbers in the brackets are pairs of values and there can be an arbitrary number of spaces to separate them. So these would also be valid things to parse:

{  22 456              7 333     }

And of course the content of the brackets might be empty, i.e. {}

In addition I want the parser to be able to handle the case where the content is a bit malformed, eg. { some descriptive text } or maybe more likely { 12 3 4} (invalid since the 4 wouldn't be paired with anything). In this case I just want the contents saved to be processed separately.

I have this so far:

type DimNummer = int
type ObjektNummer = int
type DimObjektPair = DimNummer * ObjektNummer
type ObjektListResult = Result<DimObjektPair list, string>


let sieObjektLista = 
    let pnum = numberLiteral NumberLiteralOptions.None "dimOrObj"
    let ws = spaces
    let pobj = pnum .>> ws |>> fun x -> 
        let on: ObjektNummer = int x.String
        on
    let pdim = pnum |>> fun x -> 
        let dim: DimNummer = int x.String
        dim

    let pdimObj = (pdim .>> spaces1) .>>. pobj |>> DimObjektPair

    let toObjektLista(objList:list<DimObjektPair>) = 
        let res: ObjektListResult = Result.Ok objList
        res

    let pdimObjs = sepBy pdimObj spaces1
    let validList = pdimObjs |>> toObjektLista

    let toInvalid(str:string) = 
        let res: ObjektListResult = 
            match str.Trim(' ')  with 
            | "" -> Result.Ok []
            | _ -> Result.Error str
        res

    let invalidList = manyChars anyChar |>> toInvalid
    let pres = between (pchar '{') (pchar '}') (ws >>. (validList <|> invalidList) .>> ws)
    pres

let parseSieObjektLista = run sieObjektLista

However running this on a valid sample I get an error:

{ 53735        7785  86231   36732         }
                     ^
Expecting: whitespace or '}'

Solution

  • You're trying to consume too many spaces.

    Look: pdimObj is a pdim, followed by some spaces, followed by pobj, which is itself a pnum followed by some spaces. So if you look at the first part of the input:

    { 53735        7785  86231   36732         }
      \___/\______/\__/\/
        ^      ^    ^   ^
        |      |    |   |
       pnum    |    |   |
        ^   spaces1 |   |
        |           |   ws
       pdim        pnum  ^
         ^          ^    |
         |          \    /
         |           \  /
         |            \/
          \          pobj
           \          /
            \________/
                ^
                |
              pdimObj
    

    One can clearly see from here that pdimObj consumes everything up to 86231, including the space just before it. And therefore, when sepBy inside pdimObjs looks for the next separator (which is spaces1), it can't find any. So it fails.

    The smallest way to fix this is to make pdimObjs use many instead of sepBy: since pobj already consumes trailing spaces, there is no need to also consume them in sepBy:

    let pdimObjs = many pdimObj
    

    But a cleaner way, in my opinion, would be to remove ws from pobj, because, intuitively, trailing spaces aren't part of the number representing your object (whatever that is), and instead handle possible trailing spaces in pdimObjs via sepEndBy:

    let pobj = pnum |>> fun x ->
        let on: ObjektNummer = int x.String
        on
    ...
    let pdimObjs = sepEndBy pdimObj spaces1