Search code examples
goparsinglexer

Participle is stating Unexpected Token


I am playing with a participle to learn how to parse and I cannot determine why this is unexpected.

// nolint: golint, dupl
package main

import (
    "fmt"
    "io"

    "github.com/alecthomas/participle/v2"
    "github.com/alecthomas/participle/v2/lexer"
)

var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
    {"Comment", `^#[^\n]*`},
    {"Ident", `^\w+`},
    {"Int", `\d+`},
    {"String", `("(\\"|[^"])*"|\S+)`},
    {"EOL", `[\n\r]+`},
    {"whitespace", `[ \t]+`},
})

type HTACCESS struct {
    Directives []*Directive `@@*`
}

type Directive struct {
    Pos lexer.Position

    ErrorDocument *ErrorDocument `@@`
}

type ErrorDocument struct {
    Code int    `"ErrorDocument" @Int`
    Path string `@String`
}

var htaccessParser = participle.MustBuild[HTACCESS](
    participle.Lexer(htaccessLexer),
    participle.CaseInsensitive("Ident"),
    participle.Unquote("String"),
    participle.Elide("whitespace"),
)

func Parse(r io.Reader) (*HTACCESS, error) {
    program, err := htaccessParser.Parse("", r)
    if err != nil {
        return nil, err
    }

    return program, nil
}

func main() {
    v, err := htaccessParser.ParseString("", `ErrorDocument 403 test`)

    if err != nil {
        panic(err)
    }

    fmt.Println(v)
}

From what I can tell, this seems to be correct, I expect 403 to be there, but I am not sure why it doesn't recognize it.

Edit: I changed my lexer to this:

var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
    {"dir", `^\w+`},
    {"int", `\d+`},
    {"str", `("(\\"|[^"])*"|\S+)`},
    {"EOL", `[\n\r]+`},
    {"whitespace", `\s+`},
})

And the error is gone, but it is still printing an empty array, not sure why. I am also unsure why using different values for the lexer fixes it either.


Solution

  • I believe I found the issue, it is the order, Ident was finding numbers in my lexer via the \w tag, so this caused my integers to be marked as ident.

    I found that I have to separate QuotedStrings and UnQuotedStrings otherwise unquoted strings was picking up integers. Alternatively I could ensure it only picks up non-numeric values, but that would miss things like stringwithnum2

    Here is my solution

    var htaccessLexer = lexer.MustSimple([]lexer.SimpleRule{
        {"Comment", `(?i)#[^\n]*`},
        {"QuotedString", `"(\\"|[^"])*"`},
        {"Number", `[-+]?(\d*\.)?\d+`},
        {"UnQuotedString", `[^ \t]+`},
        {"Ident", `^[a-zA-Z_]`},
        {"EOL", `[\n\r]+`},
        {"whitespace", `[ \t]+`},
    })
    
    type ErrorDocument struct {
        Pos lexer.Position
    
        Code int    `"ErrorDocument" @Number`
        Path string `(@QuotedString | @UnQuotedString)`
    }
    

    This fixed my issue, because it now finds quoted strings, then looks for Numbers, then looks for unquoted strings.