Search code examples
f#multipartfparsec

F#, FParsec, and Calling a Stream Parser Recursively


I'm developing a multi-part MIME parser using F# and FParsec. I'm developing iteratively, and so this is highly unrefined, brittle code--it only solves my first immediate problem. Red, Green, Refactor.

I'm required to parse a stream rather than a string, which is really throwing me for a loop. Given that constraint, to the best of my understanding, I need to call a parser recursively. How to do that is beyond my ken, at least with the way I've proceeded thus far.

namespace MultipartMIMEParser

open FParsec
open System.IO

type private Post = { contentType : string
                    ; boundary    : string
                    ; subtype     : string
                    ; content     : string }

type MParser (s:Stream) =
  let ($) f x = f x
  let ascii = System.Text.Encoding.ASCII
  let str cs = System.String.Concat (cs:char list)
  let q = "\""
  let qP = pstring q
  let pSemicolon = pstring ";"
  let manyNoDoubleQuote = many $ noneOf q
  let enquoted = between qP qP manyNoDoubleQuote |>> str
  let skip = skipStringCI
  let pContentType = skip "content-type: "
                     >>. manyTill anyChar (attempt $ preturn () .>> pSemicolon)
                     |>> str
  let pBoundary = skip " boundary=" >>. enquoted
  let pSubtype = opt $ pSemicolon >>. skip " type=" >>. enquoted
  let pContent = many anyChar |>> str // TODO: The content parser needs to recurse on the stream.
  let pStream = pipe4 pContentType pBoundary pSubtype pContent
                      $ fun c b t s -> { contentType=c; boundary=b; subtype=t; content=s }
  let result s = match runParserOnStream pStream () "" s ascii with
                 | Success (r,_,_) -> r
                 | Failure (e,_,_) -> failwith (sprintf "%A" e)
  let r = result s
  member p.ContentType = r.contentType
  member p.Boundary = r.boundary
  member p.ContentSubtype = r.subtype
  member p.Content = r.content

The first line of the example POST follows:

content-type: Multipart/related; boundary="RN-Http-Body-Boundary"; type="multipart/related"

It spans a single line in the file. Further sub-parts in the content include content-type values that span multiple lines, so I know I'll have to refine my parsers if I am to reuse them.

Somehow I've got to call pContent with the (string?) results of pBoundary so that I can split the rest of the stream on the appropriate boundaries, and then somehow return multiple parts for the content of the post, each of which will be a separate post, with headers and content (which will obviously have to be something other than a string). My head is spinning. This code already seems far too complex to parse a single line.

Much appreciation for insight and wisdom!


Solution

  • This is a fragment that might get you going in the right direction.

    Get your parsers to spit out something with the same base type. I prefer to use F#'s discriminated unions for this purpose. If you really do need to push values into a Post type, then walk the returned AST tree. That's just the way I'd approach it.

    #if INTERACTIVE
    #r"""..\..\FParsecCS.dll"""    // ... edit path as appropriate to bin/debug, etc.
    #r"""..\..\FParsec.dll"""
    #endif
    
    let packet = @"content-type: Multipart/related; boundary=""RN-Http-Body-Boundary""; type=""multipart/related""
    
    --RN-Http-Body-Boundary
    Message-ID: <25845033.1160080657073.JavaMail.webmethods@exshaw>
    Mime-Version: 1.0
    Content-Type: multipart/related; type=""application/xml"";
      boundary=""----=_Part_235_11184805.1160080657052""
    
    ------=_Part_235_11184805.1160080657052
    Content-Type: Application/XML
    Content-Transfer-Encoding: binary
    Content-Location: RN-Preamble
    Content-ID: <1430586.1160080657050.JavaMail.webmethods@exshaw>"
    
    //XML document begins here...
    
    type AST =
    | Document of AST list
    | Header of AST list
    /// ie. Content-Type is the tag, and it consists of a list of key value pairs
    | Tag of string * AST list  
    | KeyValue of string * string
    | Body of string
    

    The AST DU above could represent a first pass of the example data you posted in your other question. It could be finer grained than that, but simpler is normally better. I mean, the ultimate destination in your example is a Post type, and you could achieve that with some simple pattern matching.