Search code examples
regexstringjulia

Find all substrings matching a complex pattern


Let alpha a string of randomly sampled elements from the set 1, 2, 3, 4, 5, 6. For example, alpha could be "1132345216".

Assume alpha is long enough to contain at least four sub-strings satisfying the following:

  • The substring starts with a sequence of length N of 2s, 3s and/or 4s, in any order, possibly repeated or missing. For example, the substring may begin 2222... or 234234234....
  • The substring ends with one of the following two patterns: a 6 or more than M characters 5.

For example, "2346" and "23455" would satisfy the properties if N = 3, M = 2.

I want to find all substrings of this kind in alpha in Julia. Of course, one thinks of regular expressions. I am somewhat versed with regular expressions from the perspective of formal language theory, but I have never used them in a programming language, and there are differences. I have failed to achieve the desired result.

A quick sample code for anyone who cares to try this:

pattern_string = r"..." # What's the right regex???

# Test string to search for matches
test_string = "1111122211111 2323232234233246 5161532161 232342342322224444223323555555"

# Find all matches in the test string
matches = eachmatch(pattern_str, test_string)
# Output the matches found
println("Matches found:")
for match in matches
    println(match.match)
end

In the example, I added spaces for visual clarity; the first substring (before the first space) should NOT be a match, the second one should be for a small N; the third one should not be a match, the last one should be a match if M is less than 6.


Solution

  • Assuming you have n and m variables defined, you can create a regex using an interpolated string:

    n=10
    m=4
    
    pattern_string = Regex("[234]{$n}[1-6]*?(?:6|5{$(m+1),})")
    

    For the sample data, this gives a pattern string of

    [234]{10}[1-6]*?(?:6|5{6,})
    

    This matches:

    • [234]{10} : 10 of 2, 3, 4, in any order
    • [1-6]*? : a minimal number of 1-6
    • (?:6|5{6,}) : either a 6, or 6 or more 5

    For your sample data, this matches 2323232234233246 and 232342342322224444223323555555.

    Regex demo on regex101

    Julia demo on Try it online!

    If I've misinterpreted your question, and the substrings are not allowed to contain 1, or 5 or 6 except at the end, you can change the regex to:

    pattern_string = Regex("[234]{$n,}(?:6|5{$(m+1),})")
    

    This will just match a sequence of n or more 2,3 or 4, followed by a 6 or m+1 or more 5s.

    For your sample data this matches the same substrings.

    Julia demo on Try it online!