Search code examples
parsingf#fparsec

Improving the readability of a FParsec parser


I have a hand-written CSS parser done in C# which is getting unmanageable and was trying to do it i FParsec to make it more mantainable. Here's a snippet that parses a css selector element made with regexes:

var tagRegex = @"(?<Tag>(?:[a-zA-Z][_\-0-9a-zA-Z]*|\*))";
var idRegex = @"(?:#(?<Id>[a-zA-Z][_\-0-9a-zA-Z]*))";
var classesRegex = @"(?<Classes>(?:\.[a-zA-Z][_\-0-9a-zA-Z]*)+)";
var pseudoClassRegex = @"(?::(?<PseudoClass>link|visited|hover|active|before|after|first-line|first-letter))";
var selectorRegex = new Regex("(?:(?:" + tagRegex + "?" + idRegex + ")|" +
                                 "(?:" + tagRegex + "?" + classesRegex + ")|" +
                                  tagRegex + ")" +
                               pseudoClassRegex + "?");

var m = selectorRegex.Match(str);

if (m.Length != str.Length) {
    cssParserTraceSwitch.WriteLine("Unrecognized selector: " + str);
    return null;
}

string tagName = m.Groups["Tag"].Value;

string pseudoClassString = m.Groups["PseudoClass"].Value;
CssPseudoClass pseudoClass;
if (pseudoClassString.IsEmpty()) {
    pseudoClass = CssPseudoClass.None;
} else {
    switch (pseudoClassString.ToLower()) {
        case "link":
            pseudoClass = CssPseudoClass.Link;
            break;
        case "visited":
            pseudoClass = CssPseudoClass.Visited;
            break;
        case "hover":
            pseudoClass = CssPseudoClass.Hover;
            break;
        case "active":
            pseudoClass = CssPseudoClass.Active;
            break;
        case "before":
            pseudoClass = CssPseudoClass.Before;
            break;
        case "after":
            pseudoClass = CssPseudoClass.After;
            break;
        case "first-line":
            pseudoClass = CssPseudoClass.FirstLine;
            break;
        case "first-letter":
            pseudoClass = CssPseudoClass.FirstLetter;
            break;
        default:
            cssParserTraceSwitch.WriteLine("Unrecognized selector: " + str);
            return null;
    }
}

string cssClassesString = m.Groups["Classes"].Value;
string[] cssClasses = cssClassesString.IsEmpty() ? EmptyArray<string>.Instance : cssClassesString.Substring(1).Split('.');
allCssClasses.AddRange(cssClasses);

return new CssSelectorElement(
    tagName.ToLower(),
    cssClasses,
    m.Groups["Id"].Value,
    pseudoClass);

My first attempt yielded this:

type CssPseudoClass =
    | None = 0
    | Link = 1
    | Visited = 2
    | Hover = 3
    | Active = 4
    | Before = 5
    | After = 6
    | FirstLine = 7
    | FirstLetter = 8

type CssSelectorElement = 
    { Tag : string
      Id : string
      Classes : string list    
      PseudoClass : CssPseudoClass } 
with
    static member Default = 
        { Tag = "";
          Id = "";
          Classes = [];
          PseudoClass = CssPseudoClass.None; }

open FParsec

let ws = spaces
let str = skipString
let strWithResult str result = skipString str >>. preturn result

let identifier =
    let isIdentifierFirstChar c = isLetter c || c = '-'
    let isIdentifierChar c = isLetter c || isDigit c || c = '_' || c = '-'    
    optional (str "-") >>. many1Satisfy2L isIdentifierFirstChar isIdentifierChar "identifier"

let stringFromOptional strOption =
    match strOption with
    | Some(str) -> str
    | None -> ""

let pseudoClassFromOptional pseudoClassOption =
    match pseudoClassOption with
    | Some(pseudoClassOption) -> pseudoClassOption
    | None -> CssPseudoClass.None

let parseCssSelectorElement =
    let tag = identifier <?> "tagName"
    let id = str "#" >>. identifier <?> "#id"
    let classes = many1 (str "." >>. identifier) <?> ".className"
    let parseCssPseudoClass =
        choiceL [ strWithResult "link" CssPseudoClass.Link;
                  strWithResult "visited" CssPseudoClass.Visited;
                  strWithResult "hover" CssPseudoClass.Hover;
                  strWithResult "active" CssPseudoClass.Active;
                  strWithResult "before" CssPseudoClass.Before;
                  strWithResult "after" CssPseudoClass.After;
                  strWithResult "first-line" CssPseudoClass.FirstLine;
                  strWithResult "first-letter" CssPseudoClass.FirstLetter]
                 "pseudo-class"    
    // (tag?id|tag?classes|tag)pseudoClass?
    pipe2 ((pipe2 (opt tag) 
                  id
                  (fun tag id -> 
                      { CssSelectorElement.Default with 
                          Tag = stringFromOptional tag;
                          Id = id })) |> attempt
           <|>
           (pipe2 (opt tag) 
                  classes
                  (fun tag classes -> 
                      { CssSelectorElement.Default with 
                          Tag = stringFromOptional tag;
                          Classes = classes })) |> attempt
           <|>
           (tag |>> (fun tag -> { CssSelectorElement.Default with Tag = tag })))
           (opt (str ":" >>. parseCssPseudoClass) |> attempt)
           (fun selectorElem pseudoClass -> { selectorElem with PseudoClass = pseudoClassFromOptional pseudoClass })

But I'm not really liking how it's shaping up. I was expecting to come up with something easier to understand, but the part parsing (tag?id|tag?classes|tag)pseudoClass? with a few pipe2's and attempt's is really bad.

Came someone with more experience in FParsec educate me on better ways to accomplish this? I'm thinking on trying FSLex/Yacc or Boost.Spirit instead of FParsec is see if I can come up with nicer code with them


Solution

  • As Mauricio said, if you find yourself repeating code in an FParsec parser, you can always factor out the common parts into a variable or custom combinator. This is one of the great advantages of combinator libraries.

    However, in this case you could also simplify and optimize the parser by reorganizing the grammer a bit. You could, for example, replace the lower half of the parseCssSelectorElement parser with

    let defSel = CssSelectorElement.Default
    
    let pIdSelector = id |>> (fun str -> {defSel with Id = str})
    let pClassesSelector = classes |>> (fun strs -> {defSel with Classes = strs})
    
    let pSelectorMain = 
         choice [pIdSelector 
                 pClassesSelector 
                 pipe2 tag (pIdSelector <|> pClassesSelector <|>% defSel)
                       (fun tagStr sel -> {sel with Tag = tagStr})]
    
    pipe2 pSelectorMain (opt (str ":" >>. parseCssPseudoClass))
          (fun sel optPseudo ->
               match optPseudo with
               | None -> sel
               | Some pseudo -> {sel with PseudoClass = pseudo})
    

    By the way, if you want to parse a large number of string constants, it's more efficient to use a dictionary based parsers, like

    let pCssPseudoClass : Parser<CssPseudoClass,unit> =
        let pseudoDict = dict ["link", CssPseudoClass.Link
                               "visited", CssPseudoClass.Visited
                               "hover", CssPseudoClass.Hover
                               "active", CssPseudoClass.Active
                               "before", CssPseudoClass.Before
                               "after", CssPseudoClass.After
                               "first-line", CssPseudoClass.FirstLine
                               "first-letter", CssPseudoClass.FirstLetter]        
        fun stream ->
            let reply = identifier stream            
            if reply.Status <> Ok then Reply(reply.Status, reply.Error)
            else 
                let mutable pseudo = CssPseudoClass.None
                if pseudoDict.TryGetValue(reply.Result, &pseudo) then Reply(pseudo)
                else // skip to beginning of invalid pseudo class                   
                    stream.Skip(-reply.Result.Length)
                    Reply(Error, messageError "unknown pseudo class")