Search code examples
python-3.xpython-re

Issue with re.sub in Python3 and not in Python2


I've got an old script in Python 2.7 that runs a re.sub process correctly. However when I try to use it in Python 3 I get TypeError: expected string or bytes-like object

The relevant code is

substitution_array=[
    [r"^Map From GroupLayer","Add Map GroupLayer"],[r"^Map From","Add Map Auto Layer"]
    ,[r"^\s+Papersize\s+.*",""],[r"^Set Window.*",""],[r"^Open Window.*",""]]

for row in substitution_array:
        print(row[0])
        for x in newfile:
          line = re.sub(row[0],row[1],x)
          line2=filter(line.strip, line)
          newfile2.append(line2)
        print ("Finished: "+row[0])
        newfile=newfile2
        newfile2=[]

I get the following output

G:\GIS_Tables\Vector_Data\Administrative\Cadastre\Road_Reserves>python3 Create_MB_from_WOR.py
--- Table Name: Road_Reserves
^Map From GroupLayer
Finished: ^Map From GroupLayer
^Map From
Traceback (most recent call last):
  File "Create_MB_from_WOR.py", line 43, in <module>
    line = re.sub(row[0],row[1],x)
  File "C:\OSGeo4W64\apps\Python37\lib\re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

So it is failing on ,[r"^Map From","Add Map Auto Layer"] and when I delete this it fails on the next one as well.

I had a look at https://docs.python.org/3/library/re.html and think that I have escaped things correctly but what's wrong here?

Here's the same code running on the same data in Python 2.7 correctly enter image description here


Solution

  • You did not provide a reproducible example, but I reproduced the error with the following:

    import re
    
    newfile = ['a']  # wasn't defined, assuming a list of strings
    newfile2 = []        # wasn't defined, assuming a list
    
    substitution_array=[
        [r"^Map From GroupLayer","Add Map GroupLayer"],[r"^Map From","Add Map Auto Layer"]
        ,[r"^\s+Papersize\s+.*",""],[r"^Set Window.*",""],[r"^Open Window.*",""]]
    
    for row in substitution_array:
            print(row[0])
            for x in newfile:
              print(f'{x=}')
              line = re.sub(row[0],row[1],x)
              line2=filter(line.strip, line)
              print(f'{line2=}')
              newfile2.append(line2)
              print(f'{newfile2=}')
            print ("Finished: "+row[0])
            newfile=newfile2
            newfile2=[]
            print(f'{newfile=} {newfile2=}')
    

    Output (comments added):

    ^Map From GroupLayer
    x='a'     # x is a string
    line2=<filter object at 0x000001E3D5BAAE50> # filter() returns a iterable object in Python 3
    newfile2=[<filter object at 0x000001E3D5BAAE50>] # newfile gets this object
    Finished: ^Map From GroupLayer
    newfile=[<filter object at 0x000001E3D5BAAE50>] newfile2=[]
    ^Map From
    x=<filter object at 0x000001E3D5BAAE50>  # NEXT ITERATION, x is that filter object
    Traceback (most recent call last):
      File "C:\Users\metolone\test.py", line 14, in <module>
        line = re.sub(row[0],row[1],x)    # then re.sub complains about it
      File "D:\dev\Python39\lib\re.py", line 210, in sub
        return _compile(pattern, flags).sub(repl, string, count)
    TypeError: expected string or bytes-like object
    

    What do you think line2 = filter(line.strip,line) does anyway? This is saying "remove characters where line.strip(x) is true for each_character in line". In this case line.strip(' ') for example will only return false if all the characters in the line are spaces, so if there is any variation in the line it will remain unchanged and any line with all the same character will be blanked. The filter function will also be called x number of times for a line of length x, which is inefficient as well. Example from Python 2:

    >>> line = '  \n  a '          # variation, no change
    >>> filter(line.strip,line)
    '  \n  a '                     
    >>> line = '            '      # all spaces, blanks the line
    >>> filter(line.strip,line)
    ''
    >>> line = '   \n     '        # different kinds of whitespace, no change
    >>> filter(line.strip,line)
    '   \n     '
    >>> line = '\n\n\n\n\n'        # all same newline, blanks line
    >>> filter(line.strip,line)
    ''
    >>> line = '\n\n \n\n'         # different kinds of whitespace, no change
    >>> filter(line.strip,line)
    '\n\n \n\n'
    >>> line = 'aaaaaaaaaaaaaaaa'  # no variation, blanks the line
    >>> filter(line.strip,line)
    ''
    

    So this looks like a bug and you may want to state what you think this is supposed to do and we can recommend a better way to do it.