Search code examples
regexperlbnf

Using Perl look-ahead assertion to find individual list


Given a list like this:

direct_SQL_statement ::=
  directly_executable_statement semicolon

directly_executable_statement ::=
    direct_SQL_data_statement
  | SQL_schema_statement
  | SQL_transaction_statement
  | SQL_connection_statement
  | SQL_session_statement
  | direct_implementation_defined_statement

direct_SQL_data_statement ::=
    delete_statement__searched
  | direct_select_statement__multiple_rows
  | insert_statement
  | update_statement__searched
  | truncate_table_statement
  | merge_statement
  | temporary_table_declaration

direct_implementation_defined_statement ::=
  "!! See the Syntax Rules."

apostrophe ::=
  "'"
/*
5.2     token and separator

Function

Specify lexical units (tokens and separators) that participate in SQL language.


Format
*/
token ::=
    nondelimiter_token
  | delimiter_token

identifier_part ::=
    identifier_start
  | identifier_extend
/*
identifier_start ::=
  "!! See the Syntax Rules."
identifier_extend ::=
  "!! See the Syntax Rules."
*/
large_object_length_token ::=
  digit+ multiplier

Is it possible to use Perl's look-ahead assertion to break it up into individual definition list?

I tried,

perl -0777ne 'print "$&\n^^\n\n" while /(?=\w+\s*::=)\w+\s*::=\s*.+/gs;'

but it just returned the whole thing (as if the look-ahead assertion is not working at all), while

perl -0777ne 'print "$&\n^^\n\n" while /(?=\w+\s*::=)\w+\s*::=\s*.+?/gs;'

comes up just too short:

direct_SQL_statement ::=
  d
^^

directly_executable_statement ::=
    d
^^

direct_SQL_data_statement ::=
    d
^^

direct_implementation_defined_statement ::=
  "
^^

I need to break it up into individual BNF definition chunks to further process, like this for the initial test data:

direct_SQL_statement ::=
  directly_executable_statement semicolon
^^


directly_executable_statement ::=
    direct_SQL_data_statement
  | SQL_schema_statement
  | SQL_transaction_statement
  | SQL_connection_statement
  | SQL_session_statement
  | direct_implementation_defined_statement
^^


direct_SQL_data_statement ::=
    delete_statement__searched
  | direct_select_statement__multiple_rows
  | insert_statement
  | update_statement__searched
  | truncate_table_statement
  | merge_statement
  | temporary_table_declaration
^^


direct_implementation_defined_statement ::=
  "!! See the Syntax Rules."
^^

Notes,

  • the above output is from the initial test data.
  • The whole A ::= B thing is called a BNF definition. the "^^" is only for visual indication that the separation is done properly.
  • the apostrophe and the following token are different BNF definitions and should be treated as such. The /* ... */ comment should be filtered out from the output.
  • comments may come without empty lines surrounding them. That's the reason I need to rely on the look-ahead assertion instead of the paragraphs mode.
  • The question comes as a follow up to How can EBNF or BNF be parsed?, of which the solution is "W3C EBNF doesn't end a production with a semicolon because a ::= operator comes after the LHS symbol of a new production."
  • The whole file can be found at github.com/ronsavage/SQL/blob/master/sql-2016.ebnf

Solution

  • With possible comments (/* ... */) that need be omitted:

    perl -0777 -wnE'say for m{(.*?::=.*?)\n (?: \n+ | (?:/\*.*?\*/) | \z)}gsx' bnf.txt
    

    This captures a line with ::= and all that follows it up to: more newlines, or /*...*/ comment, or end-of-string.

    The modifier /s makes . match newlines as well, what it normally doesn't, so that .*? can match multiline text. With /x literal spaces are ignored and can be used for readability.

    Or, first remove comments and then split the input string by more-than-one newlines

    perl -0777 -wnE's{ (?: /\* .*? \*/ ) }{\n}gsx; say for split /\n\n+/;' bnf.txt
    

    I don't see a need for lookaheads.


    The original version of this post used a paragraph mode, via -00, or a regex that splits the whole input by multiple newlines.

    That was exceedingly simple and clean -- with the input from the original version of the question, that is, which had no comments. The comments that were then added may have empty lines and reading in paragraphs doesn't fly anymore since spurious ones would be introduced.

    I'm restoring it below since it's been deemed useful --

    If there's always an empty line separating chunks of interest then can process in paragraphs

    perl -00 -wne'print' file
    

    This retains the empty line, which you appear to want to keep anyway. If not, it can be removed.

    (Then curiously can evan do simply perl -00 -pe'1' file)

    Otherwise, can break that string on more-than-one newline

    perl -0777 -wnE'@chunks = split /\n\n+/; say for @chunks' file
    

    or, if you indeed need to just output them

    perl -0777 -wnE'say for split /\n\n+/' file
    

    Empty lines between chunks are now removed.

    I don't see a reason to go for a lookahead.


    I realize that a "BNF definition" may be the line(s) after the one with ::=. In that case, one way

    perl -0777 -wnE'say for /(.+?::=.*?)\n(?:\n+|\z)/gs' file
    

    However, with possible comments (/* ... */) that need be omitted:

    perl -0777 -wnE'say for m{(.*?::=.*?)\n (?: \n+ | (?:/\*.*?\*/) | \z)}gsx' bnf.txt
    

     


    A reminder: all revisions to posts can be seen via the link which is right under a post, with the text of the last-edit timestamp.