I have a list with strings as below.
candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
And I want the list to be filtered as ["HelloWorld", "Foo", "Bar"]
, because others are substrings. I can do it like this, but don't think it's fast or elegant.
def filter_not_substring(candidates):
survive = []
for a in candidates:
for b in candidates:
if a == b:
continue
if a in b:
break
else:
survive.append(a)
return survive
Is there any fast way to do it?
How about:
candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
print(result)
Counter to what was suggested in the comments:
from timeit import timeit
def filter_not_substring(candidates):
survive = []
for a in candidates:
for b in candidates:
if a == b:
continue
if a in b:
break
else:
survive.append(a)
return survive
def filter_not_substring2a(candidates):
return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]
def filter_not_substring2b(candidates):
return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
print(timeit(lambda: filter_not_substring(xs)))
print(timeit(lambda: filter_not_substring2a(xs)))
print(timeit(lambda: filter_not_substring2b(xs)))
Result:
['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
1.5163685
4.6516653
3.8334089999999996
So, OP's solution is substantially faster, but filter_not_substring2b
is still about 20% faster than 2a
. So, putting the len
comparison first doesn't save time.
For any production scenario, OP's function is probably optimal - a way to speed it up might be to bring the whole problem into C, but I doubt that would show great gains, since the logic is pretty straightforward already and I'd expect Python to do a fairly good job of it as well.
User @ming noted that OP's solution can be improved a bit:
def filter_not_substring_b(candidates):
survive = []
for a in candidates:
for b in candidates:
if a in b and a != b:
break
else:
survive.append(a)
return survive
This version of the function is somewhat faster, for me about 10-15%
Finally, note that this is only just faster than 2b
, even though it is very similar to the optimised solution by @ming, but almost 3x slower than their solution. It's unclear to me why that would be - if anyone has fairly certain thoughts on that, please share in the comments:
def filter_not_substring_c(candidates):
return [a for a in candidates if all(a not in b or a == b for b in candidates)]