Search code examples
rubyparsingrecursive-descent

Parser in Ruby: dealing with sticky comments and quotes


I am trying to make a recursive-descent parser in Ruby for a grammar, which is defined by the following rules

  1. Input consists of white-space separated Cards starting with a Stop-word, where white-space is regex /[ \n\t]+/
  2. Card may consist of Keywords or/and Values also separated by white-space, which have card-specific order/pattern
  3. All Stop-words and Keywords are case-insensitive, i.e.: /^[a-z]+[a-z0-9]*$/i
  4. Value can be a double-quoted string, which may be not separated from other words by a white-space, e.g.:

    word"quoted string"word
    
  5. Value can be also a word /^[a-z]+[a-z0-9]*$/, or integer, or float (e.g. -1.15, or 1.0e+2)

  6. Single-line comment is denoted by # and may be not separated from other words, e.g.:

    word#single-line comment\n
    
  7. Multi-line comment is denoted by /* and */ and may be not separated from other words, e.g.:

    word/*multi-line 
    comment*/word
    

# Input example. Stop-words are chosen just to highlight them: set, object
set title"Input example"set objects 2#not-separated by white-space. test: "/*
set test "#/*"
object 1 shape box/* shape is a Keyword, 
box is a Value. test: "#*/object 2 shape sphere
set data # message and complete are Values
0 0 0 0 1 18 18 18 1 35 35 35 72 35 35 # all numbers are Values of the Card "set"

Since most of the words are separated by white-space, for a while I was thinking about splitting the whole input and parsing word-by-word. To deal with comments and quotes, I was going to do

words = input_text.gsub( /([\"\#\n]|\/\*|\*\/)/, ' \1 ' ).split( /[ \t]+/ )

However, in this way the content of strings (and comments, if I want to keep them) is modified. How would you deal with these sticky comments and quotes?


Solution

  • OK, I made it myself. One can minimize the following code if its readability is not necessary

    class WordParser
      attr_reader :words
    
      def initialize text
        @text = text
      end
    
      def parse
        reset_parser
        until eof?
          case curr_char
            when '"' then
              start_word and add_chars_until? '"'
              close_word
            when '#','%' then
              start_word and add_chars_until? "\n"
              close_word
            when '/' then
              if next_is? '*' then
                start_word and 2.times { add_char }
                add_char until curr_is? '*' and next_is? '/' or eof?
                2.times { add_char } unless eof?
                close_word
              else
                # parser_error "unexpected symbol '/'" # if not allowed in the grammar
                start_word unless word_already_started?
                add_char
              end
            when /[^\s]/ then
              start_word unless word_already_started?
              add_char
          else # skip whitespaces etc. between words
            move and close_word
          end
        end
        return @words
      end
    
    private
    
      def reset_parser
        @position = 0
        @line, @column = 1, 1
        @words = []
        @word_started = false
      end
    
      def parser_error s
        Kernel.puts 'Parser error on line %d, col %d: ' + s
        raise 'Parser error'
      end
    
      def word_already_started?
        @word_started
      end
    
      def close_word
        @word_started = false
      end
    
      def add_chars_until? ch
        add_char until next_is? ch or eof?
        2.times { add_char } unless eof?
      end
    
      def add_char
        @words.last[:to] = @position
        # @words.last[:length] += 1
        # @word.last += curr_char # if one just collects words
        move
      end
    
      def start_word
        @words.push from: @position, to: @position, line: @line, column: @column
        # @words.push '' unless @words.last.empty? # if one just collects words
        @word_started = true
      end
    
      def move
        increase :@position
        return if eof?
        if prev_is? "\n"
          increase :@line
          reset :@column
        else
          increase :@column
        end
      end
    
      def reset var; instance_variable_set(var, 1) end
      def increase var; instance_variable_set(var, instance_variable_get(var)+1) end
    
      def eof?; @position >= @text.length end
    
      def prev_is? ch; prev_char == ch end
      def curr_is? ch; curr_char == ch end
      def next_is? ch; next_char == ch end
    
      def prev_char; @text[ @position-1 ] end
      def curr_char; @text[ @position   ] end
      def next_char; @text[ @position+1 ] end
    end
    

    Test using the example I have in my question

    words = WordParser.new(text).parse
    p words.collect { |w| text[ w[:from]..w[:to] ] } .to_a
    
    # >> ["# Input example. Stop-words are chosen just to highlight them: set, object\n", 
    # >>  "set", "title", "\"Input example\"", "set", "objects", "2", 
    # >>  "#not-separated by white-space. test: \"/*\n", "set", "test", "\"#/*\"", 
    # >>  "object", "1", "shape", "box", "/* shape is a Keyword, \nbox is a Value. test: \"#*/", 
    # >>  "object", "2", "shape", "sphere", "set", "data", "# message and complete are Values\n", 
    # >>  "0", "0", "0", "0", "1", "18", "18", "18", "1", "35", "35", "35", "72", 
    # >>  "35", "35", "# all numbers are Values of the Card \"set\"\n"]
    

    So now I can use something like this to parse the words further.