I have a large list of strings (len > 1000000) which I'm trying to convert into a dataframe. Before that, I have to exclude some items which are numbers 20~22 digits long (e.g: 0000000011111216546546), since they are a problem for my dataframe.
I got this done with the following code, which I know it is not pythonic (and takes 20 minutes to run):
for i in range(len(lines2)):
if re.match(r'(?:(?<!\d)\d{20,30}(?!\d))', lines2[i]):
exclusao2.append(lines2[i])
lines2 = [x for x in lines2 if x not in exclusao2]
Anyone here knows how to do this in a lower time?
Your code is slow because you're building a list and then using this just to check membership. Testing membership repeatedly from a list is inefficient.
Assuming you have 1% values to remove, this gives you 10k strings in exclusao2
. Then for each of these you will test membership in your list comprehension. This is slow.
To improve your code you need to use a set.
exclusao2 = set()
for s in lines2:
if re.match(r'(?:(?<!\d)\d{20,30}(?!\d))', s):
exclusao2.add(s)
lines2 = [x for x in lines2 if x not in exclusao2]
But that's still not ideal.
You can use your test directly:
lines2 = [s for s in lines2 if not re.match(r'(?:(?<!\d)\d{20,30}(?!\d))', s)]
Even better, as your check is quite simple, avoid regexes, use string operations as suggested in comments:
lines2 = [s for s in lines2
if not (s.isdigit() and (20 <= len(s) <= 22))]