Search code examples
pythonregexregex-negationregex-greedy

REGEX for matching string plus first occurrence of the sub string


I am looking for a regex that solves the below problem. In below example, I am extracting the values for VITMINC

Test String

1. As VITMINC##1-0-1##1 days.As
2. amlodipine ##vitamin c##diclo##tabramycin eye drop##metformin##0-0-1##15 days. amlodipine ##1-0-1## 3 days.
3. Xylometazoline(P) Nasal drops##0-1-0##2 days.   Paracetamol 500mg tab##0.5-0-0-0##2 days. VITMINC##0.5-0-0##2 days.   Chlorpheniramine maleate 4mg tab##0-1-0##2 days.
4. VITMINC##0-0-0-1##2 days.
5. amlodipine##vitamin c##diclo##tabramycin eye drop##metformin##0-0-1##15 days.

Sample Output for the above strings

1. VITMINC##1-0-1##1 days
2. ##vitamin c##diclo##tabramycin eye drop##metformin##0-0-1##15 days
3. VITMINC##0.5-0-0##2 day
4. VITMINC##0-0-0-1##2 days
5. vitamin c##diclo##tabramycin eye drop##metformin##0-0-1##15 days

I am trying with below regex but not getting the expected output

VITMINC##.*##([0-9]+ [days]){1}?
VITMINC##.*##([0-9]+ [days])*?
VITMINC##.*##[0-9]+ days
VITMINC##.*##([0-9]+ days){1}?

Sorry, if my bad explanation. And thanks in Advance.


Solution

  • Assuming you do not actually want to have # chars at the beginning of the matches (examples 2 and 5 are self-contradictory), you can use

    (?i)VITA?MIN\s*C##.*?##[0-9]+ days
    

    See the regex demo.

    Details

    • (?i) - case insensitive modifier
    • VITA?MIN - VITAMIN or VITMIN
    • \s* - 0 or more whitespaces
    • C## - a C## substring
    • .*? - any zero or more chars other than line break chars,as few as possible
    • ## - a ## substring
    • [0-9]+ days - 1 or more digits, space, days substring.