Search code examples
pythonregexparsingnegative-lookbehind

Regex matching with negative look-behind assertion


I am parsing C source files. I want to match all the variables (in snake-case format) that end in _VALUE and don't begin with CANA_, CANB_... ,CANF_. I need to match the whole variable name for later substitution.

This is my current setup with python

import re

def signal_ending_VALUE_updater(match: re.Match) -> str:
    groups = match.groupdict()
    return some_operation_on(group["SIGNAL_NAME"])

REGEX=r"(?<!CAN[A-F]_)\b(?P<SIGNAL_NAME>\w+_VALUE)\b"

with open(file_path,'r') as f:
   content = f.read()
   content_new = re.sub(REGEX,signal_ending_VALUE_updater,content)

Unfortunately this regex doesn't work all the times, for example if we try this testacase

test="        shared->option.mem = ((canAGetScuHmiVehReqLiftModBtnSt() == CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE) ||"
re.find(REGEX,test)

Will return the variable (CANA_SCU_HMI...) that I don't want to match. What am I not considering in the regex?

The idea behind the regex is:

  • (?<!CAN[A-F]_): with a negative-lookbehind ensure the match does not start with CAN followed by one of the letters A, B, C, D, E, or F, and an underscore (_).
  • \b: word boundary, ensuring that we are matching whole words and not one part of a word
  • (?P<SIGNAL_NAME>\w+_VALUE):
    • (?P<SIGNAL_NAME>...): group match with the name SIGNAL_NAME
    • \w same as [a-zA-Z0-9_] will match snakecase variable names
    • + ensures one or more of before
    • _VALUE matches the literal string _VALUE at the end of the variable name.
  • \b This again is a word boundary that ensures the match ends right after the variable name.

Solution

  • This part of your regex (?<!CAN[A-F]_)\b asserts that this pattern CAN[A-F]_ does not occur directly to the left of the current position followed by a word boundary.

    You get a match for this text CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE because at the beginning of that text, that assertion is true.

    What you can do instead is start with a word boundary, and then assert that what is directly to the right does not match the pattern CAN[A-F]_

    \b(?!CAN[A-F]_)(?P<SIGNAL_NAME>\w+_VALUE)\b
    

    See a regex 101 demo