Search code examples
pythonregexnumpyregex-group

Using Regex to match graph intervals


Say I have an array of numbers: arr=[1, 2, 3, 4, 5, 4, 3, 2, 1].

Regular Regex has its basic units/literals as characters. I am wondering if there is a way to have "objects" as the basic unit, for an object [a,b].

Now I want to answer the question, "does arr contain at least 2 intervals of length 3 that have positive gradients?". I was thinking that I could mimic Regex by replacing characters with [a,b] objects, where a is the size of the interval and b is the function to run on that interval (gradient_pos). the function would return True or False depending on whether or not the condition was met. For example, to answer the above question, you would query:

[3,gradient_pos]{2,}

You will see that the syntax follows all the regex rules, except that rather than looking at characters, we are looking at [a,b] objects that effectively evaluate to True or False. The idea is that regex would try and match 2 or more intervals of length 3 that satisfy gradient_pos. Does that make sense? I know this is very abstract, but if anyone could help me out with achieving something like this, I'd greatly appreciate it!

Thanks!


Solution

  • You can treat it like a regex problem. So your array would become:

    import numpy as np
    arr=[1, 2, 3, 4, 5, 4, 3, 2, 1]
    arr = np.array(arr)
    arr_as_str=''.join(map(str,np.sign(arr[1:]-arr[:-1])+1))
    

    I got the gradients excluding the last element and shifted the sign function from (-1,+1) to (0,2). The reason for that is to avoid the minus signs in the string representation.

    So now you can just apply regex matching as usual where:

    • negative gradient: 0
    • zero gradient: 1
    • positive gradient: 2

    Your "3 consecutive positive gradients" query would look like this:

    matches=list(re.finditer('(?=(2{3}))',arr_as_str)) 
    

    Since there are overlapping regex matches, you have to use lookaheads like in: Python regex find all overlapping matches?

    You can refer back to the original array using the starting indices plus you know how long the sequences are:

    matches[1].start()
    

    I suggested this approach because it seems like you're comfortable with regular expressions. However, if you're interested in exploring other approaches I would recommend looking at morphological filtering, it would have been my first choice