I'm using the grammar on this site in my javacc. It works fine apart from some picture statements. For example ----,---,---.99 or --9.
It doesn't seem to like more than one dash.
What do I need to change in this to support my picture examples.
I'v messed about with
void NumericConstant() :
{}
{
(<PLUSCHAR>|<MINUSCHAR>)? IntegerConstant() [ <DOTCHAR> IntegerConstant() ]
}
but nothing seems to be working. Any help is much appreciated
EDIT:
<COBOL_WORD: ((["0"-"9"])+ (<MINUSCHAR>)*)*
(["0"-"9"])* ["a"-"z"] ( ["a"-"z","0"-"9"] )*
( (<MINUSCHAR>)+ (["a"-"z","0"-"9"])+)*
>
Is this the regular expression for this whole line:
07 STRINGFIELD2 PIC AAAA. ??
If I want to accept 05 TEST3 REDEFINES TEST2 PIC X(10).
would I change the regex to be:
<COBOL_WORD: ((["0"-"9"])+ (<MINUSCHAR>)*)* (<REDEFINES> (["0"-"9"])* ["a"-"z"] ( ["a"-"z","0"-"9"] )*)? (["0"-"9"])* ["a"-"z"] ( ["a"-"z","0"-"9"] )* ( (<MINUSCHAR>)+ (["a"-"z","0"-"9"])+)*
Thanks a lot for the help so far
Why are you messing around with NumericConstant()
when you are trying to parse a
COBOL PICTURE string?
According to the JavaCC source you have, a COBOL PICTURE should parse with:
void DataPictureClause() :
{}
{
( <PICTURE> | <PIC> ) [ <IS> ] PictureString()
}
the --9
bit is a Picture String and should parse with the PictureString()
function:
void PictureString() :
{}
{
[ PictureCurrency() ]
( ( PictureChars() )+ [ <LPARENCHAR> IntegerConstant() <RPARENCHAR> ] )+
[ PicturePunctuation() ( ( PictureChars() )+ [ <LPARENCHAR> IntegerConstant() <RPARENCHAR> ] )+ ]
}
PictureCurrency()
comes up empty so move on to PictureChars()
:
void PictureChars() :
{}
{
<INTEGER> | <COBOL_WORD>
}
But COBOL_WORD
does not appear to support many "interesting" valid PICTURE clause definitions:
<COBOL_WORD: ((["0"-"9"])+ (<MINUSCHAR>)*)*
(["0"-"9"])* ["a"-"z"] ( ["a"-"z","0"-"9"] )*
( (<MINUSCHAR>)+ (["a"-"z","0"-"9"])+)*
>
Parsing COBOL is not easy, in fact it is probably one of the most difficult languages in existance to build a quality parser for. I can tell you right now that the JavaCC source you are working from is not going to cut it - except for some very simple and probably totally artificial COBOL program examples.
Answer to comment
COBOL Picture strings tend to mess up the best of parsers. The minus sign you are having trouble with is only the tip of the iceburg! Picture Strings are difficult to parse through because the period and comma may be part of a Picture string but serve as separators outside of the string. This means that parsers cannot unambiguously classify a period or comma in a context free manner. They need to be "aware" of the context in which it is encountered. This may sound trivial but it isn't.
Technically, the separator period and comma must be followed by a space (or end of line). This little fact could make determining the period/comma role very simple because a Picture String cannot contain a space. However, many commercial COBOL compilers are "smart" enough correctly recognize separator periods/commas that are not followed by a space. Consequently there are a lot of COBOL programmers that code illegal separator period/commas, which means you will probably have to deal with them.
The bottom line is that no matter what you do, those little Picture Strings are going to haunt you. They will take quite a bit of effort to to deal with.
Just a hint of things to come, how would you parse the following:
01 DISP-NBR-1 PIC -99,999.
01 DISP-NBR-2 PIC -99,999..
01 DISP-NBR-3 PIC -99,999, .
01 DISP-NBR-4 PIC -99,999,.
The period following DISP-NBR-1
terminates the Picture string. It is a separator period. The
period following DISP-NBR-2
is part of the string, the second period is the separator. The comma
following DISP-NBR-3
is a separator - it is not part of the Picture string. However the comma
following DISP-NBR-4
is part of the Picture string because it is not followed by a space.
Welcome to COBOL!