Search code examples
regexjulia

Match repetitions of 1,2,3,4 of length `n` that end with at least `m` 5s or `m` 6s, except for the first and last match


I have a string which consists only of the symbols 1, 2, 3, 4, 5, 6. I will present the problem "in layers" because it is quite complex, at least for me (I don't know my way through regexes).

I want to find those portions of the string that contain repetitions of 2, 3, 4 or 1 of length n or more, but the character "1" should not contribute to the length count.

I also want the pattern to be found iff it ends with at least m occurrences of character five or m occurrences of character 6. For example, if n=10, m=10, consider the following string (where the spaces are only for clarity, the string itself will contain no spaces):

A = "66666666611166111 23423412342341133355555555555 2342345"

The pattern in the middle should be matched. Because it is a repetition of 2, 3, and 4s of length > 10 (ignoring the 1s, which don't count), and it ends with 11 >= 10 5s.

A more complex example, with several matches (shown by the spaces):

A = "66666666611166111 23423412342341133355555555555 2342345515 2341123423423423423466666666666666"

Importantly, we want to match repetitions of 2,3,4s of length n (ignoring 1s) at the end as well (this is, not ending with 5s or 6s but with empty string). For example, in

A = "66666666611166111 23423412342341133355555555555 2342345515 23411234234234234234"

we should match the second and the fourth pattern.

As a last condition, only for the first and the last matches, we must impose n=1 and m=1. This is, for the first and last matches, it doesn't matter with how many 5s or 6s the pattern ends, we should still match it. For example, in

A = "66666666611166111 2342341234234113336 62342345515 2341123423423423423455555555555555 212315551235555 222222222222266"

we should match the second segment, because it is the first repetition of 2,3,4,1 of the desired length, regardless of the fact that it ends with only a single 6. We should also math the last segment, because it is the last repetition of 2,3,4,1 of the desired length, regardless of the fact that it ends with simply two 6s. The fourth pattern should also be matched.

How can such a complex regex be built? Answers in Julia code are particularly appreciated.


Solution

  • I want to find those portions of the string that contain repetitions of 2, 3, 4 or 1 of length n or more, but the character "1" should not contribute to the length count.

    A regex of \b1*(?:[2345]1*){n,}\b where the n is replaced by the value wanted. The \b at the start and finish is a word boundary between, for these examples, spaces and digits. The two occurences of 1* allow for any number of 1 digits without them being counted. The {n,} says there should be n-or-more occurrences of the bracketed item before which says one of the digits 2 or 3 or 4 or 5, followed by zero or more 1s. The (?:...) is a non-capturing group.

    I also want the pattern to be found iff it ends with at least m occurrences of character five or m occurrences of character 6. For example, if n=10, m=10, consider the following string (where the spaces are only for clarity, the string itself will contain no spaces):

    Extending the above regex for the trailing 5s or 6s gives:

    \b1*(?:[234]1*){n,}(?:5{n,}|6{m,})\b again replacing n and m with numbers. Note that the question uses n for two different counts.

    As a last condition, only for the first and the last matches, ...

    What are the first and last matches? This section of the requirement is not clear to me.