Search code examples
javaregexgreedyquantifiers

java lookbehind for split by greedy quantifiers expressions


I wrote the following expression to split a string after every x word (3 for instance) followed by a space. My problem is that I need to keep the entire content. But I cannot find a way to use look behind etc to accomplish this in Java.

Anyone has experience with that?

String text = "Hello my name is Tom and i love playing football";
String regex = "([a-zA-Z0-9öÖäÄüÜß]+\\s){" + ngramm_length + "}";
System.out.println(regex);
String[] ngramms = text.split(regex);

result are 4 tokens but only the last one still contains the content, I would like to get:

1: Hello my name 2: is Tom and 3: i love playing 4: football

Look into the match information box in the link JAVA Code:

public static void main(String[] args) throws IOException {     
    int length = 3; //2
    String dynamic_length = "";
    for (int i = 1; i < length; i++) {       
        dynamic_length += i;

        if (i + 1 < length) {
            dynamic_length += ",";         
        }
    }

    final String regex = "([a-zA-Z0-9öÖäÄüÜß]+\\s){" + length + "}|([a-zA-Z0-9öÖäÄüÜß]+\\s){" + dynamic_length + "}";
    final String string = "Hello my name is Tom and i love playing football\n\n";

    final Pattern pattern = Pattern.compile(regex);
    final Matcher matcher = pattern.matcher(string);
    int count = 0;
    while (matcher.find()) {
        ++count;
        System.out.println("match:" + count + " " + matcher.group(0));
    }

it is not dynamic because it is only working for length of 2 and 3. That's my problem with it or do I miss something?

for x > 1 i can use:

final String regex = "([a-zA-Z0-9öÖäÄüÜß]+\\s){" + length + "}|([a-zA-Z0-9öÖäÄüÜß]+\\s){1," + (length - 1) + "}";

for x = 1 i can use:

final String regex = "([a-zA-Z0-9öÖäÄüÜß]+\\s){" + length + "}|([a-zA-Z0-9öÖäÄüÜß]+\\s){1}";

or just splitting by space.

Thanks to Maverick_Mrt !!!


Solution

  • You can try this:

    ([a-zA-Z0-9öÖäÄüÜß]+\s){3}|([a-zA-Z0-9öÖäÄüÜß]+\s){1,2}
    

    Explanation

    Look into the match information box in the link JAVA Code:

    public static void main(String[] args) {
        final String regex = "([a-zA-Z0-9öÖäÄüÜß]+\\s){3}|([a-zA-Z0-9öÖäÄüÜß]+\\s){1,2}";
        final String string = "Hello my name is Tom and i love playing football\n\n";
    
        final Pattern pattern = Pattern.compile(regex);
        final Matcher matcher = pattern.matcher(string);
        int count = 0;
        while (matcher.find()) {
            ++count;
            System.out.println("match:" + count + " " + matcher.group(0));
        }
    

    As per your comment:

    if you want n block per match then you do it, make sure n>0

    ([a-zA-Z0-9öÖäÄüÜß]+\s){n}|([a-zA-Z0-9öÖäÄüÜß]+\s){1,n-1}
    
    
    Sample output
    
        match:1 Hello my name 
        match:2 is Tom and 
        match:3 i love playing 
        match:4 football