Search code examples
javaregexstringstring-parsing

Split textual script into substrings by pattern


Consider following script (it's total nonsense in pseudo-language):

if (Request.hostMatch("asfasfasf.com") && someString.existsIn(new String[] {"brr", "hrr"}))   {
    if (Requqest.clientIp("10.0.x.x")) {
        somevar = "1";
    }
    somevar = "2";
}
else {
    somevar = "first";
}
string foo = "foo";
// etc. etc.

How would you grab if-block's parameters and contents from it? The if-block has format of:

if<whitespace>(<parameters>)<whitespace>{<contents>}<anything>

I tried using String.split() with regex pattern of ^if\s*\(|\)\s*\{|\}\s* but this fails miserably. Namely, the problem is that ) { is found also in inner if-block and the closing } is found from many places as well. I don't think neither lazy or eager expansion works here.

So... any pointers to what might I need here in order to implement this with regex?

I also need to get the remaining string without the if-block's code (so code starting from else { ...). Using just String.split() seems to make it difficult as there is no information about the length of the parts that were parsed away.

I initially created a loop based solution (using String.substring() heavily) for this, but it's dull. I would like to have something fancier instead. Should I go with regex or create a custom, generic function (there are many other cases than just this) that takes the parseable String and the pattern instead (consider the if<whitespace>(... pattern above)?

Edit: Changed returns to variable assignments as it would have not made sense otherwise.


Solution

  • A regular language won't work because a regular grammar can't match things like "any number of open parenthesis followed by any number of close parenthesis". A context-free grammar would be needed for that.

    Unless you use a context-free grammar parser for Java or a regular expression extension that makes regular expressions no longer regular, your loop-based solution is probably the fanciest solution.