Search code examples
c#parsingsuperpower

Superpower: Match any not white character except for tokenizer


I would like to use the Nuget package Superpower to match all non-white characters unless it is a tokenized value. E.g.,

var s = "some random text{variable}";

Should result in:

["some", "random", "text", "variable"]

But what I have now is:

["some", "random", "text{variable}"]

The parsers for it look like:

    public static class TextParser
    {
        public static TextParser<string> EncodedContent =>
            from open in Character.EqualTo('{')
            from chars in Character.Except('}').Many()
            from close in Character.EqualTo('}')
            select new string(chars);

        public static TextParser<string> HtmlContent =>
            from content in Span.NonWhiteSpace
            select content.ToString();
    }

Of course I'm returning the strings in another variable in the parser. But this just simplified.

Hopefully that is enough information. If not I do have the whole repo up on Github. https://github.com/jon49/FlowSharpHtml


Solution

  • There could be many different ways to parse your input, and depending on how much more complex your inputs really are (as you say you've simplified it), you will probably need to tweak this. But the best way to approach using Superpower is to create small parsers and then build upon them. See my parsers and their descriptions below (each one building upon the previous):

    /// <summary>
    /// Parses any character other than whitespace or brackets.
    /// </summary>
    public static TextParser<char> NonWhiteSpaceOrBracket =>
        from c in Character.Except(c => 
            char.IsWhiteSpace(c) || c == '{' || c == '}',
            "Anything other than whitespace or brackets"
        )
        select c;
    
    /// <summary>
    /// Parses any piece of valid text, i.e. any text other than whitespace or brackets.
    /// </summary>
    public static TextParser<string> TextContent =>
        from content in NonWhiteSpaceOrBracket.Many()
        select new string(content);
    
    /// <summary>
    /// Parses an encoded piece of text enclosed in brackets.
    /// </summary>
    public static TextParser<string> EncodedContent =>
        from open in Character.EqualTo('{')
        from text in TextContent
        from close in Character.EqualTo('}')
        select text;
    
    /// <summary>
    /// Parse a single content, e.g. "name{variable}" or just "name"
    /// </summary>
    public static TextParser<string[]> Content =>
        from text in TextContent
        from encoded in EncodedContent.OptionalOrDefault()
        select encoded != null ? new[] { text, encoded } : new[] { text };
    
    /// <summary>
    /// Parse multiple contents and flattens the result.
    /// </summary>
    public static TextParser<string[]> AllContent =>
        from content in Content.ManyDelimitedBy(Span.WhiteSpace)
        select content.SelectMany(x => x.Select(y => y)).ToArray();
    

    Then to run it:

    string input = "some random text{variable}";
    var result = AllContent.Parse(input);
    

    Which outputs:

    ["some", "random", "text", "variable"]
    

    The idea here is to build a parser to parse out one content, then leveraging Superpower's built in parser called ManyDelimitedBy to kind of simulate a "split" on the whitespace in between the real content you're looking to parse out. This results in an array of "content" pieces.

    Also you may want to take advantage of Superpower's token functionality to produce better error messages when parsing fails. It's a slightly different approach, but take a look at this blog post to read more about how to use the tokenizer, but it's completely optional if you don't need more friendly error messages.