Working out what was matched by each part of a regex

I have a few hundred (fairly simple) regular expressions and their matches within a large number of sequences. I would like to be able to tell what part of each regex matches what position in the target sequences. For example, the following regex “[DSTE][^P][^DEWHFYC]D[GSAN]” may be matched by positions 4 to 8 in the following sequence:

ABCSGADAZZZ

What I would like to get out (programmatically) is, for each regex, 1) each 'part' of the regex and 2) the position in the target sequence that was matched by it:

[DSTE] -- (3, 4),
[^P] -- (4, 5),
[^DEWHFYC] -- (5, 6),
D -- (6, 7),
[GSAN] -- (7, 8)

I found this website which essentially does what I want: https://regex101.com/, but I’m not sure how deep into regex parsing I would have to dive to be able to do this in my own code (I'm using Python and R).

Solution

It's still not 100% but I returned outputs on 3365/3510 of my dataset. The few I checked lined up :)

Included in my github (Linked below) are the inputs and outputs in csv, txt (respectively).

Please ignore the global variables; I was considering switching the code over to see if there was a noticeable improvement with the speed but didn't get around it.

Currently this version is having trouble with order of operations regarding the alternation and the start/end line operators (^ $) if they are options for alternation at the beginning or end of the string. I'm pretty confident it has to do with precedent; but I wasn't able to get it organized well enough.

The function call for the code is in the last cell. Instead of running the entire DataFrame with

for x in range(len(df)):

try:
    df_expression = df.iloc[x, 2]
    df_subsequence = df.iloc[x, 1]

    # call function
    identify_submatches(df_expression, df_subsequence)
    print(dataframe_counting)
    dataframe_counting += 1
except:
    pass

You can easily test one at a time by passing a pattern and a corresponding sequence to the function like this:

p = ''
s = ''

identify_submatches(p, s)

Code: https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/motif_matching_02.ipynb

Inputs: https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/elm_compiled_ss_re.csv

Outputs: https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/motif_matching_02_outputs.txt

"""exp_a as input expression
       sub_a as input subject string"""

    input_exp = exp_a
    input_sub = sub_a

    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
    m_set = '\^*\[.+?\]({.+?})*\$*'
    m_alt = '\|'
    m_lit = '\^*[.\w]({.+?})*\$*|\$'


    # PRINTOUT
    if (print_type == 1):
        print('\nExpression Input: {}\nSequence Input: {}'.format(exp_a, sub_a))

    if (print_type == 3):
        print('\n\nSTART ITERATION\nINPUTS\n  exp: {}\n  seq: {}'.format(exp_a, sub_a))


    # return the pattern match (USE IF SUB IS NOT MATCHED PRIMARY)
    if r.search(exp_a, sub_a) is not None:
        m = r.search(exp_a, sub_a)
        sub_a = m.group()
        # >>>PRINTOUT<<<
        if print_type == 3:
            print('\nSEQUENCE TYPE M\n  exp: {}\n  seq: {}'.format(exp_a, sub_a))

        elif m is None:
            print('Search expression: {} in sequence: {} returned no matches.\n\n'.format(exp_a, sub_a))
            return None

    if (print_type == 1):
        print('Subequence Match: {}'.format(sub_a))



    # check if main expression has unnested alternation
    if len(alt_states(exp_a)) > 0:
        # returns matching alternative
        exp_a = alt_evaluation(exp_a, sub_a)

        # >>>PRINTOUT<<<
        if print_type == 3:
            print('\nALTERNATION RETURN\n  exp: {}\n  seq: {}'.format(exp_a, sub_a))


    # get initial expression list
    exp_list = get_states(exp_a)


    # count possible expression constructions
    status, matched_tuples = finite_state(exp_list, sub_a)

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\nCONFIRM EXPRESSION\n  exp: {}'.format(matched_tuples))


    # index matches
    indexer(input_exp, input_sub, matched_tuples)


def indexer(exp_a, sub_a, matched_tuples):

    sub_length = len(sub_a)
    sub_b = r.search(exp_a, sub_a)
    adj = sub_b.start()
    sub_b = sub_b.group()

    print('')

    for pair in matched_tuples:
        pattern, match = pair

        start = adj
        adj = adj + len(match)
        end = adj
        index_pos = (start, end)

        sub_b = slice_string(match, sub_b)
        print('\t{}\t{}'.format(pattern, index_pos))

def strip_nest(s):

    s = s[1:]
    s = s[:-1]

    return s

def slice_string(p, s):

    pat = p
    string = s

    # handles escapes
    p = r.escape(p)
    # slice the input string on input pattern
    s = r.split(pattern = p, string = s, maxsplit = 1)[1]


    # >>>PRINTOUT<<<
    if print_type == 4:
        print('\nSLICE STRING\n  pat: {}\n  str: {}\n  slice: {}'.format(pat, string, s))


    return s

def alt_states(exp):
    # check each character in string
    idx = 0 # index tracker
    op = 0 # open parenth
    cp = 0 # close parenth
    free_alt = [] # amend with index position of unnested alt

    for c in exp:
        if c == '(':
            op += 1
        elif c == ')':
            cp += 1
        elif c == '|':
            if op == cp:
                free_alt.append(idx)
        if idx < len(exp)-1:
            idx+=1

    # split string if found
    alts = []

    if free_alt:
        _ = 0
        for i in free_alt:
            alts.append(exp[_:i])
            alts.append(exp[i+1:])

    # the truth value of this check can be checked against the length of the return
    # len(free_alt) > 0 means unnested "|" found
    return alts


def alt_evaluation(exp, sub):

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\nALTERNATION SELECTION\n  EXP: {}\n  SEQ: {}'.format(exp, sub))

    # gets alt index position
    alts = alt_states(exp)

    # variables for eval
    a_len = 0 # length of alternate match
    keep_len = 0 # length of return match
    keep = '' # return match string

    # evaluate alternatives
    for alt in alts:
        m = r.search(alt, sub)
        if m is not None:
            a_len = len(m.group())                             # length of match string

            # >>>PRINTOUT<<<
            if print_type == 3:
                print('  pat: {}\n  str: {}\n  len: {}'.format(alt, m.group(0), len(m.group(0))))

            if a_len >= keep_len:                              
                keep_len = a_len                               # sets alternate length to keep length
                exp = alt                                     # sets alt as keep variable

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('  OUT: {}'.format(exp))                

    return exp

def get_states(exp):
    """counts number of subexpressions to be checked
       creates FSM"""

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\nGET STATES\n  EXP: {}'.format(exp))

    # List of possible subexpression regex matches
    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
    m_set = '\^*\[.+?\]({.+?})*\$*'
    m_alt = '\|'
    m_lit = '\^*[.\w]({.+?})*\$*|\$'


    # initialize capture list
    exp_list = []

    # loop through first level of subexpressions: 
    while exp != '':

        if r.match(m_gro, exp):
            _ = r.match(m_gro, exp).group(0)
            exp_list.append(_)
            exp = slice_string(_, exp)

        elif r.match(m_set, exp):
            _ = r.match(m_set, exp).group(0)
            exp_list.append(_)
            exp = slice_string(_, exp)



        elif r.match(m_alt, exp):
            _ = ''

        elif r.match(m_lit, exp):
            _ = r.match(m_lit, exp).group(0)
            exp_list.append(_)
            exp = slice_string(_, exp)

        else:
            print('ERROR getting states')
            break

    n_states = len(exp_list)


    # >>>PRINTOUT<<<
    if print_type == 3:
        print('GET STATES OUT\n  states:\n  {}\n  # of states: {}'.format(exp_list, n_states))


    return exp_list


def finite_state(exp_list, seq, level = 0, pattern_builder = '', iter_count = 0, pat_match = [], seq_match = []):


    # >>>PRINTOUT<<<
    if (print_type == 3):
        print('\nSTARTING MACHINE\n  EXP: {}\n  SEQ: {}\n  LEVEL: {}\n  matched: {}\n  pat_match: {}'.format(exp_list, seq, level, pattern_builder, pat_match))


    # patterns
    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
    m_set = '\^*\[.+?\]({.+?})*\$*'
    m_alt = '\|'
    m_squ = '\{(.),(.)\}'
    m_lit = '\^*[.\w]({.+?})*\$*|\$'


    # set state, n_state
    state = 0
    n_states = len(exp_list)
    #save_state = []
    #save_expression = []


    # temp exp
    local_seq = seq

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\n  >>>MACHINE START')



    # set failure cap so no endless loop
    failure_cap = 1000

    # since len(exp_list) returns + 1 over iteration (0 index) use the last 'state' as success state
    while state != n_states:

        for exp in exp_list:

            # iterations
            iter_count+=1

            # >>>PRINTOUT<<<
            if print_type == 3:
                print('  iteration count: {}'.format(iter_count))

            # >>>PRINTOUT<<<
            if print_type == 3:
                print('\n  evaluating: {}\n  for string: {}'.format(exp, local_seq))


            # alternation reset
            if len(alt_states(exp)) > 0:

                # get operand options
                operands = alt_states(exp)               
                # create temporary exp list
                temp_list = exp_list[state+1:]
                # add level
                level = level + 1                   

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  ALT MATCH: {}\n  state: {}\n  opers returned: {}\n  level in: {}'.format(exp, state, operands, level))

                # compile local altneration
                for oper in operands:
                    # get substates
                    _ = get_states(oper)
                    # compile list
                    oper_list = _ + temp_list
                    # send to finite_state, sublevel                    
                    alt_status, pats = finite_state(oper_list, local_seq, level = level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
                    if alt_status == 'success':
                        return alt_status, pats


            # group cycle
            elif r.match(m_gro, exp) is not None:
                # get operand options
                operands = group_states(exp)
                # create temporary exp list
                temp_list = exp_list[state+1:]
                # add level
                level = level + 1

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  GROUP MATCH: {}\n  state: {}\n  opers returned: {}\n  level in: {}'.format(exp, state, operands, level))

                # compile local
                oper_list = operands + temp_list
                # send to finite_state, sublevel
                group_status, pats = finite_state(oper_list, local_seq, level=level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
                if group_status == 'success':
                    return group_status, pats


            # quantifier reset
            elif r.search(m_squ, exp) is not None:
                # get operand options
                operands = quant_states(exp)
                # create temporary exp list
                temp_list = exp_list[state+1:]
                # add level
                level = level + 1

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  QUANT MATCH: {}\n  state: {}\n  opers returned: {}\n  level in: {}'.format(exp, state, operands, level))

                # compile local
                for oper in reversed(operands):
                    # compile list
                    oper_list = [oper] + temp_list
                    # send to finite_state, sublevel
                    quant_status, pats = finite_state(oper_list, local_seq, level=level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
                    if quant_status == 'success':
                        return quant_status, pats


            # record literal
            elif r.match(exp, local_seq) is not None:
                # add to local pattern
                m = r.match(exp, local_seq).group(0)
                local_seq = slice_string(m, local_seq)

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  state transition: {}\n  state {} ==> {} of {}'.format(exp, state, state+1, n_states))

                # iterate state for match
                pattern_builder = pattern_builder + exp
                pat_match = pat_match + [(exp, m)]
                state += 1
            elif r.match(exp, local_seq) is None:
                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  Return FAIL on {}, level: {}, state: {}'.format(exp, level, state))
                status = 'fail'
                return status, pattern_builder


            # machine success
            if state == n_states:

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  MACHINE SUCCESS\n  level: {}\n  state: {}\n  exp: {}'.format(level, state, pattern_builder))

                status = 'success'
                return status, pat_match

            # timeout
            if iter_count == failure_cap:
                state = n_states

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('===============\nFAILURE CAP MET\n  level: {}\n  exp state: {}\n==============='.format(level, state))
                break

def group_states(exp):


    # patterns
    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*\$*'
    m_set = '\^*\[.+?\]({.+?})*\$*'
    m_alt = '\|'
    m_squ = '\{(.),(.)\}'
    m_lit = '\^*[.\w]({.+?})*\$*'

    ret_list = []

    # iterate over groups
    groups = r.finditer(m_gro, exp)

    for gr in groups:
        _ = strip_nest(gr.group())      

        # alternation reset
        if r.search(m_alt, _):
            ret_list.append(_)

        else:
            _ = get_states(_)
            for thing in _:
                ret_list.append(thing)

    return(ret_list)

def quant_states(exp):


    # >>>PRINTOUT<<<
    if print_type == 4:
        print('\nGET QUANT STATES\n  EXP: {}'.format(exp))

    squ_opr = '(.+)\{.,.\}'
    m_squ = '\{(.),(.)\}'

    # create states
    states_list = []    

    # get operand
    operand_obj = r.finditer(squ_opr, exp)
    for match in operand_obj:
        operand = match.group(1)

    # get repetitions
    fa = r.findall(m_squ, exp)
    for m, n in fa:
        # loop through range
        for x in range(int(m), (int(n)+1)):
            # construct string
            _ = operand + '{' + str(x) + '}'
            # append to list
            states_list.append(_)

    # >>>PRINTOUT<<<
    if print_type == 4:
        print('  QUANT OUT: {}\n'.format(states_list))

    return states_list

%%time

print_type = 1
"""0:    
   1: includes input
   2: 
   3: all output prints on """


dataframe_counting = 0
for x in range(len(df)):

    try:
        df_expression = df.iloc[x, 2]
        df_subsequence = df.iloc[x, 1]

        # call function
        identify_submatches(df_expression, df_subsequence)
        print(dataframe_counting)
        dataframe_counting += 1
    except:
        pass

Output Return Examples

Output values (i.e. subexpression & index set) are tab delimited.

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TRQARRNRRRRWRERQRQIH
Subequence Match: RRRRWR

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2270

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TASQRRNRRRRWKRRGLQIL
Subequence Match: RRRRWK

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2271

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TRKARRNRRRRWRARQKQIS
Subequence Match: RRRRWR

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2272

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: LDFPSKKRKRSRWNQDTMEQ
Subequence Match: KKRKRSRWN

    [KR]{4} (5, 9)
    [KR]    (9, 10)
    .   (10, 11)
    [KR]    (11, 12)
    W   (12, 13)
    .   (13, 14)
2273

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: ASQPPSKRKRRWDQTADQTP
Subequence Match: KRKRRWD

    [KR]{2} (6, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2274

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: GGATSSARKNRWDETPKTER
Subequence Match: RKNRWD

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2275

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: PTPGASKRKSRWDETPASQM
Subequence Match: KRKSRWD

    [KR]{2} (6, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2276

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: LLNAATALSGSMQYLLNYVN
Subequence Match: LLNAATALSGSMQYLLNYV

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2277

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IFEASKKVTNSLSNLISLIG
Subequence Match: IFEASKKVTNSLSNLISLI

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2278

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IYEKAKEVSSALSKVLSKID
Subequence Match: IYEKAKEVSSALSKVLSKI

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2279

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IYKAAKDVTTSLSKVLKNIN
Subequence Match: IYKAAKDVTTSLSKVLKNI

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2280

Data from: ELM (The Eukaryotic Linear Motif resource for Functional Sites in Proteins) 2020. Retrieved from http://elm.eu.org/searchdb.html