Search code examples
pythonsql-serverpostgresqltext-processing

Python re.findall regex and text processing


I'm looking to find and modify some sql syntax around the convert function. I want basically any convert(A,B) or CONVERT(A,B) in all my files to be selected and converted to B::A.

So far I tried selecting them with re.findall(r"\bconvert\b\(.*?,.*\)", l, re.IGNORECASE) But it's only returning a small selection out of what I want and I also have trouble actually manipulating the A/B I mentioned.

For example, a sample line (note the nested structure here is irrelevant, I'm only getting the outer layer working if possible)

convert(varchar, '/' || convert(nvarchar, es.Item_ID) || ':' || convert(nvarchar, o.Option_Number) || '/') as LocPath

...should become...

'/' || es.Item_ID::nvarchar || ':' || o.Option_Number::nvarchar || '/' :: varchar as LocPath

Example2:

SELECT LocationID AS ItemId, convert(bigint, -1),

...should become...

SELECT LocationID AS ItemId, -1::bigint,

I think this should be possible with some kind of re.sub with groups and currently have a code structure inside a for each loop where line is the each line in the file:

matchConvert = ["convert(", "CONVERT("]
a = next((a for a in matchConvert if a in line), False)
if a:
    print("convert() line")
    #line = re.sub(re.escape(a) + r'', '', line)

Edit: In the end I went with a non re solution and handled each line by identifying each block and manipulate them accordingly.


Solution

  • Here's my solution based on @Иван-Балван's code. Breaking this structure into blocks makes further specification a lot easier than I previously thought and I'll be using this method for a lot of other operations as well.

    # Check for balanced brackets
    def checkBracket(my_string):
        count = 0
        for c in my_string:
            if c == "(":
                count+=1
            elif c == ")":
                count-=1
        return count
    
    
    # Modify the first convert in line
    # Based on suggestions from stackoverflow.com/questions/73040953
    def modifyConvert(l):
        # find the location of convert()
        count = l.index('convert(')
    
        # select the group before convert() call
        before = l[:count]
    
        group=""
        n1=0
        n2=0
        A=""
        B=""
        operate = False
        operators = ["|", "<", ">", "="]
        # look for A group before comma
        for n1, i in enumerate(l[count+8:], start=len(before)+8):
            # find current position in l
            checkIndex = checkBracket(l[count+8:][:n1-len(before)-8])
            if i == ',' and checkIndex == 0:
                A = group
                break
            group += i
    
        # look for B group after comma
        group = ""
        for n2, i in enumerate(l[n1+1:], start=n1+1):
            checkIndex = checkBracket(l[count+n1-len(before):][:n2-n1+1])
            if i == ',' and checkIndex == 0:
                return l
            elif checkIndex < 0:
                B = group
                break
            group += i
            
            # mark operators
            if i in operators:
                operate = True
    
        # select the group after convert() call
        after = l[n2+1:]
    
        # (B) if it contains operators
        if operate:
            return before + "(" + B.lstrip() + ') :: ' + A + after
        else:
            return before + B.lstrip() + '::' + A + after
    
    
    # Modify cast syntax with convert(a,b). return line.
    def convertCast(l):
    
        # Call helper for nested cases
        i = l.count('convert(')
        while i>0:
            i -= 1
            l = modifyConvert(l)
    
        return l