Search code examples
pythonsubstring

Is there a Pythonic way of filtering substrings of strings in a list?


I have a list with strings as below.

candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]

And I want the list to be filtered as ["HelloWorld", "Foo", "Bar"], because others are substrings. I can do it like this, but don't think it's fast or elegant.

def filter_not_substring(candidates):
    survive = []
    for a in candidates:
        for b in candidates:
            if a == b:
                continue
            if a in b:
                break
        else:
            survive.append(a)
    return survive

Is there any fast way to do it?


Solution

  • How about:

    candidates = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar"]
    result = [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
    print(result)
    

    Counter to what was suggested in the comments:

    from timeit import timeit
    
    
    def filter_not_substring(candidates):
        survive = []
        for a in candidates:
            for b in candidates:
                if a == b:
                    continue
                if a in b:
                    break
            else:
                survive.append(a)
        return survive
    
    
    def filter_not_substring2a(candidates):
        return [c for c in candidates if not any(len(o) > len(c) and c in o for o in candidates)]
    
    
    def filter_not_substring2b(candidates):
        return [c for c in candidates if not any(c in o and len(o) > len(c) for o in candidates)]
    
    
    xs = ["Hello", "World", "HelloWorld", "Foo", "bar", "ar", "bar"]
    print(filter_not_substring(xs), filter_not_substring2a(xs), filter_not_substring2b(xs))
    print(timeit(lambda: filter_not_substring(xs)))
    print(timeit(lambda: filter_not_substring2a(xs)))
    print(timeit(lambda: filter_not_substring2b(xs)))
    

    Result:

    ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar'] ['HelloWorld', 'Foo', 'bar', 'bar']
    1.5163685
    4.6516653
    3.8334089999999996
    

    So, OP's solution is substantially faster, but filter_not_substring2b is still about 20% faster than 2a. So, putting the len comparison first doesn't save time.

    For any production scenario, OP's function is probably optimal - a way to speed it up might be to bring the whole problem into C, but I doubt that would show great gains, since the logic is pretty straightforward already and I'd expect Python to do a fairly good job of it as well.

    User @ming noted that OP's solution can be improved a bit:

    def filter_not_substring_b(candidates):
        survive = []
        for a in candidates:
            for b in candidates:
                if a in b and a != b:
                    break
            else:
                survive.append(a)
        return survive
    

    This version of the function is somewhat faster, for me about 10-15%

    Finally, note that this is only just faster than 2b, even though it is very similar to the optimised solution by @ming, but almost 3x slower than their solution. It's unclear to me why that would be - if anyone has fairly certain thoughts on that, please share in the comments:

    def filter_not_substring_c(candidates):
        return [a for a in candidates if all(a not in b or a == b for b in candidates)]