Search code examples
pythonregexnegative-lookbehind

How to use a negative lookbehind


Basically, I am changing any and all hexadecimal values with a blue hue to its red hue counterpart in a given stylesheet (i.e. #00f is changed to #ff0000 (my function outputs six character hexadecimal values excluding the #)).

It was not a problem creating a regular expression to match hexadecimal colors (I'm not concerned about HTML color names although I may eventually care about rgb, rgba, hsb, etc. values.). This is what I ended up with #(([0-9A-z]{3}){1,2}). It works but I want it to be full proof. For example, if somebody happens to set a background image with a fragment (i.e. #top) with a valid hexadecimal value, I don't want to change it. I tried doing a negative lookbehind, but it doesn't seem to work. I was using \B#(([0-9A-z]{3}){1,2}) but if there is a word boundary (such as a space) before the '#', it match the URL fragment. This is what I thought should do the trick but doesn't: (?<!url\([^#)]*)#(([0-9A-z]{3}){1,2}).

I am using the desktop version of RegExr to test with the following stylesheet:

body {
    background: #f09 url('images#06F');
}
span {
    background=#00f url('images#889');
}
div {
    background:#E4aaa0 url('images#889');
}
h1 {
    background: #fff #dddddd;
}

Whenever, I hover over the (?<! substring, RegExr identifies it as a "Negative lookahead matching 'url\([^#)]*'." Could there be a bug or am I just having a bad regex day? And while we're at it, are there any other contextes in which a '#' is used for non-hexadecimal purposes?

EDIT: Alright, I can't program early in the morning. That hexadecimal regex should be #(([0-9A-Fa-f]{3}){1,2})

EDIT 2: Alright, so I missed the detail that most languages require static length lookbehinds.


Solution

  • I think that what you need is either one of the following solutions or the other

    ss = '''    background: #f09 url('images#06F'); 
        background=#00f url('images #889'); 
        background:#E4aaa0 url('images#890'); 
        background: #fff #dddddd; '''
    
    print ss
    import re
    
    three = '(?:[0-9A-Fa-f]{3})'
    
    regx = re.compile('^ *background[ =:]*#(%s{1,2})' % three,re.MULTILINE)
    print regx.findall(ss)
    
    print '-----------------------------------------------------'
    
    regx = re.compile('(?:(?:^ *background[ =:]*)|(?:(?<=#%s)|(?<=#%s%s)) +)'
                      '#(%s{1,2})' % (three,three,three,three),
                      re.MULTILINE)
    print regx.findall(ss)
    

    result

        background: #f09 url('images#06F'); 
        background=#00f url('images #889'); 
        background:#E4aaa0 url('images#890'); 
        background: #fff #dddddd; 
    ['f09', '00f', 'E4aaa0', 'fff']
    -----------------------------------------------------
    ['f09', '00f', 'E4aaa0', 'fff', 'dddddd']
    

    Edit 1

    ss = '''    background: #f09 url('images#06F'); 
        background=#00f url('images #889'); 
        color:#E4aaa0 url('images#890'); 
        background: #fff #dddddd#125e88    #ae3;
        Walter (Elias) Disney: #f51f51 '''
    
    print ss+'\n'
    
    import re
    
    three = '(?:[0-9A-Fa-f]{3})'
    
    regx = re.compile('^ *[^=:]+[ =:]*#(%s{1,2})' % three,re.MULTILINE)
    print regx.findall(ss)
    
    print '-----------------------------------------------------'
    
    regx = re.compile('(?:(?:^ *[^=:]+[ =:]*)|(?:(?<=#%s)|(?<=#%s%s)) *)'
                      '#(%s{1,2})' % (three,three,three,three),
                      re.MULTILINE)
    print regx.findall(ss)
    

    result

        background: #f09 url('images#06F'); 
        background=#00f url('images #889'); 
        color:#E4aaa0 url('images#890'); 
        background: #fff #dddddd#125e88    #ae3;
        Walter (Elias) Disney: #f51f51 
    
    ['f09', '00f', 'E4aaa0', 'fff', 'f51f51']
    -----------------------------------------------------
    ['f09', '00f', 'E4aaa0', 'fff', 'dddddd', '125e88', 'ae3', 'f51f51']
    

    Edit 2

    ss = '''    background: #f09 url('images#06F'); 
        background=#00f url('images #889'); 
        color:#E4aaa0 url('images#890'); 
        background: #fff #dddddd#125e88    #ae3;
        Walter (Elias) Disney: #f51f51
        background: -webkit-gradient(linear, from(#000000), to(#ffffff));. '''
    
    print ss+'\n'
    
    import re
    
    three = '(?:[0-9A-Fa-f]{3})'
    
    preceding = ('(?:(?:^[^#]*)'
                     '|'
                     '(?:(?<=#%s)'
                         '|'
                         '(?<=#%s%s)'
                         '|'
                         '(?<= to\()'
                         ')'
                     ')') % (three,three,three)
    
    regx = re.compile('%s *#(%s{1,2})' % (preceding,three), re.MULTILINE)
    print regx.findall(ss)
    

    result

        background: #f09 url('images#06F'); 
        background=#00f url('images #889'); 
        color:#E4aaa0 url('images#890'); 
        background: #fff #dddddd#125e88    #ae3;
        Walter (Elias) Disney: #f51f51
        background: -webkit-gradient(linear, from(#000000), to(#ffffff));. 
    
    ['f09', '00f', 'E4aaa0', 'fff', 'dddddd', '125e88', 'ae3', 'f51f51', '000000', 'ffffff']
    

    Regexes are extremely powerful in the condition that there must be enough portions of strings following a certain organisation having relative stability among variable other portions that are intended to be catched. If the analyzed text becomes too loose in its structure, it becomes impossible to write a regex.

    Are there still a lot of other "Harlequin-like patchwork" structures possible for your strings ??