Search code examples
pythonpandasdataframeoptimizationstartswith

check if string values of a column in a dataframe start with string elements of a tuple (other than str.startswith)


I have a pandas dataframe column with random values ("457645","458762496","1113423453"...) and I need to check if these values start with the elements of a tuple ("323","229","111").

In this case, it should be true for "1113423453".

I have tried df[column].str.startswith(tuple), which works fine; but for a large amount of data (2M df rows and 3K tuple elements) it becomes much slower (around 28 seconds) compared to 10K df rows and 3K tuple elements (1.47 seconds).

Is there a more efficient way to do it?


Solution

  • I have tried df[column].str.startswith(tuple), which works fine … but i'm searching for a more efficient way to do it if it's possible

    Since startswith() isn't optimized for a large amount of prefix strings and does just a linear search through them, using a binary search may be more efficient here. For this, we need the prefixes sorted.

    from bisect import bisect_right
    s = sorted(tuple)
    df[column].apply(lambda str: str.startswith(s[bisect_right(s, str)-1]))
    

    is it possible to extract the prefix into a new column of the dataframe?

    Yes, e. g. with this function:

    def startwiths(str):
        prefix = s[bisect_right(s, str)-1]
        if str.startswith(prefix): return prefix
    
    df['new column'] = df[column].apply(startwiths)