Search code examples
goragel

How to implement regex /cat[s]?(\b|$)/ with ragel correclty?


I want to speed up my program written in Go and convert regular expressions to finite state machines with ragel. I cannot figure out how to match end of input correctly when converting regular expressions similar to /cat[s]?(\b|$)/ (it matches a word border or end of input), so I made this workaround:

package main

import(
  "strings"
  "fmt"
  "unicode"
)

func Match(data []byte) bool {
  data = []byte(strings.ToLower(string(data)))

  %%{
    machine test;
    write data;
  }%%

  cs, p, pe, eof := 0, 0, len(data), len(data)
  _ = eof

  var matchPos int

  %%{
    main := ('cat' 's'?) @{matchPos = p};

    write init;
    write exec;
  }%%

  return checkMatch(data, matchPos+1)
}

func checkMatch(data []byte, p int) bool {
  if p == len(data) {
    return true
  }
  tail := string(data[p:])
  c := []rune(tail)[0]
  if !unicode.IsLetter(c) && !unicode.IsDigit(c) {
    return true
  }
  return false
}

func main () {
  vs := []string{
    "cat",
    "cats",
    "cat and dog",
    "cats and dogs",
    "caterpillar",
  }
  for _, v := range vs {
    fmt.Printf("'%s': %v\n", v, Match([]byte(v)))
  }
}

Finite State Machine Graph

the output is correct:

'cat': true
'cats': true
'cat and dog': true
'cats and dogs': true
'caterpillar': false

I do think there is a better way though. What is a "proper" way to handle end of input with ragel?


Solution

  • The proper way to handle end of input is to use EOF actions, of course. And use actions in general, like this (reduced Match function):

      var matched bool
    
      %%{
        action setMatched {
          matched = true
        }
    
        main := ('cat' 's'?) %/setMatched ([ \t] >setMatched);
    
        write init;
        write exec;
      }%%
      // Variable generated and not used by Ragel.
      _ = _test_trans_actions
    
      return matched
    

    Which produces the following output (notice one important test case added):

    'cat': true
    'cats': true
    'cat and dog': true
    'cats and dogs': true
    'catspaw': false
    'caterpillar': false
    

    And works like this: enter image description here

    What it adds is setMatched action that is triggered either by the EOF in one of the final states (%/setMatched) of the first machine (cats?), or upon entering (>setMatched) the second one (which is almost \b, but can actually be replaced with the internal space machine). And it eliminates the need for checkMatch completely.