Search code examples
prologtokenizedcg

String tokenization in prolog


I have the following context free grammar in a text file 'grammar.txt'

S ::= a S b
S ::= []

I'm opening this file and able to read each line in prolog. Now i want to tokenize each line and generate a list such as

L=[['S','::=','a','S','b'],['S','::=','#']]  ('#' represents empty)

How can i do this?


Solution

  • Write the specification in a DCG. I give you the basic (untested), you'll need to refine it.

    parse_grammar([Rule|Rules]) -->
     parse_rule(Rule),
     parse_grammar(Rules).
    parse_grammar([]) --> [].
    
    parse_rule([NT, '::=' | Body]) -->
      parse_symbol(NT),
      skip_space,
      "::=",
      skip_space,
      parse_symbols(Body),
      skip_space, !.  % the cut is required if you use findall/3 (see below)
    
    parse_symbols([S|Rest]) -->
      parse_symbol(S),
      skip_space,
      parse_symbols(Rest).
    parse_symbols([]) --> [].
    
    parse_symbol(S) -->
      [C], {code_type(C, alpha), atom_codes(S, [C])}.
    
    skip_space -->
      [C], {code_type(C, space)}, skip_space.
    skip_space --> [].
    

    This parse the whole file, using this toplevel:

      ...,
      read_file_to_codes('grammar.txt', Codes),
      phrase(parse_grammar(Grammar), Codes, [])).
    

    You say you read the file 1 line at time: then use

      ...
      findall(R, (get_line(L), phrase(parse_rule(R), L, [])), Grammar).
    

    HTH