Search code examples
pythonregexlistsubstring

How to exclude substring with regex from a large list of strings more efficiently


I have a large list of strings (len > 1000000) which I'm trying to convert into a dataframe. Before that, I have to exclude some items which are numbers 20~22 digits long (e.g: 0000000011111216546546), since they are a problem for my dataframe.

I got this done with the following code, which I know it is not pythonic (and takes 20 minutes to run):

for i in range(len(lines2)):
    if re.match(r'(?:(?<!\d)\d{20,30}(?!\d))', lines2[i]):
        exclusao2.append(lines2[i])

lines2 = [x for x in lines2 if x not in exclusao2]

Anyone here knows how to do this in a lower time?


Solution

  • Your code is slow because you're building a list and then using this just to check membership. Testing membership repeatedly from a list is inefficient.

    Assuming you have 1% values to remove, this gives you 10k strings in exclusao2. Then for each of these you will test membership in your list comprehension. This is slow.

    To improve your code you need to use a set.

    exclusao2 = set()
    for s in lines2:
        if re.match(r'(?:(?<!\d)\d{20,30}(?!\d))', s):
            exclusao2.add(s)
    
    lines2 = [x for x in lines2 if x not in exclusao2]
    

    But that's still not ideal.

    You can use your test directly:

    lines2 = [s for s in lines2 if not re.match(r'(?:(?<!\d)\d{20,30}(?!\d))', s)]
    

    Even better, as your check is quite simple, avoid regexes, use string operations as suggested in comments:

    lines2 = [s for s in lines2
              if not (s.isdigit() and (20 <= len(s) <= 22))]