Search code examples
javaregexalgorithmregex-groupmusicxml

Regex code not collecting multiple lines of matching pattern


I'm new to using regex and I was hoping that someone could help me with this.

I have this regex code which is supposed to identify tab groups in a tablature file. It works on regex testing websites such as regexr.com, regextester.com, and extendsclass.com/regex-tester, but when I code it in java using the example text shown below, I am given each individual line as its own separate group, instead of 4 groups containing all the text which are separated only by one newline. I have read through this stack overflow thread"Regular expression works on regex101.com, but not on prod" and have been careful to avoid string literal problems, multiline problems, and ive tried the code with other regex engines on regex101 and it worked, but still, it does not work in my java code shown below.

I tried enabling the multiline flag but it still doesn't work. I thought it was a problem with my code, but then I got the same wrong output on other regex tester websites: myregexp.com and freeformatter.com/java-regex-tester

here is the original regex. It is ling, so it might be easier to use the regex above as they both have the same problem I was talking about:

RealRegexCode = (^|[\n\r])(((?<=^|[\n\r])[^\S\n\r]*\|*[^\S\n\r]*((E|A|D|G|B|e|a|d|g|b)[^\S\n\r]*\|*(?=(([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))[|\r\n]|$)))((([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))\|)+(((?<=\|)[^\S\n\r]*((E|A|D|G|B|e|a|d|g|b)[^\S\n\r]*\|*(?=(([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))[|\r\n]|$)))((([^\S\n\r]*-[ -]*(?=\|))|([ -]*((\(?[a-zB-Z0-9]+\)?)+[^\S\n\r]*-[ -]*)+((\(?[a-zB-Z0-9]+\)?)+){0,1}[^\S\n\r]*))\|)+)*(\n|\r|$))+

Here is a simplified regex code that displays the same problem, provided for the sake of debugging

SimplifiedRegexCode = (^|[\n\r])([^\n\r]+(\n|\r|$))+

here is the code that finds the matches using the regex pattern:

public static void main(String[] args){
        String filePath = "C:\\Users\\stani\\IdeaProjects\project\\src\\testing files\\guitar - a thousand matches by passenger.txt";
        Path path = Path.of(filePath);
        List<String> stuff = new ArrayList<>();
        try {
            String rootStr = Files.readString(path);
            Pattern pattern = Pattern.compile("(^|[\\n\\r])([^\\n\\r]+(\\n|\\r|$))+");
            Matcher ptrnMatcher = pattern.matcher(rootStr);
            while (ptrnMatcher.find()) {
                stuff.add(ptrnMatcher.group());
            }
        }catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println(new Patterns().MeasureGroupCollection);
        for (String s:stuff)
            System.out.println(s);
    }

And here is the text I was testing it with. It might help to copy and paste this in a text editor as stack overflow might distort how the text looks:

e|---------------------------------|------------------------------------|
e|------------------------------------------------------------------|
B|-----1--------(1)----1-----------|-------1---------------1----------1-|
B|-----1--------(1)----0---------0-----1---------1-----3--------(3)-|
G|-----------0------------0--------|-------------0----------------0-----|
G|-----------0---------------0---------------0---------------0------|
D|-----0h2-----2-------2-----------|-------2-------2-------0--------0---|
D|-----2-------2-------2-------2-------2-------2-------0-------0----|
A|-3-------3-------3-------3-------|------------------------------------|
A|-0-------0--------------------------------------------------------|
E|-----------------------------0---|---1-------1-------3-------3--------|
E|-----------------0-------0--------1------1-------3-------3--------|


e|-------------------------------------------------------------------|
B|-----1---------1-----1---------1-----3---------3-------1---------1-|
G|-----------0---------------0---------------0-----------------0-----|
D|-----3-------2-------2-------2-------0-------0---------2-------2---|
A|-----------------3-------3-------------------------3-------3-------|
E|-1-------1-----------------------3-------3-------------------------|

It should identify four different groups from the text. However, in java and in the two testers I mentioned above, it recognizes each line as its own different group (i.e 12 groups)


Solution

  • I couldn't help but respond to this as I am familiar with both regex and guitar haha.

    For your short regex, please see the following regex on regex101.com: https://regex101.com/r/NqGhoh/1/

    The multiline modifier is required.

    The main problem with this is that you are handling newlines on the front and back of the expression. I have modified the expression in a couple ways:

    • Made the regex match newlines only on the end, always looking for a ^ at the beginning.
    • Matching the carriage return new line combination as \r?\n as a carriage return should always be followed by a newline when it is used.
    • Used non-capturing groups to improve overhead and reduce complexity when looking at matches. This is the ?: just inside the parenthesis. It means the group won't be captured in the result, just used for encapsulation.

    I started testing your longer regex and may update that as well, though it sounds like you already know what to do with the shorter one corrected.