Search code examples
pythonstringperformanceargumentspartial

Alternatives to using functools.partial with string methods


A profiling of my code shows that methods split and strip of str objects are amongst the the most called functions.

It happens that I use constructs such as:

with open(filename, "r") as my_file:
    for line in my_file:
        fields = line.strip("\n").split("\t")

And some of the files to which this is applied have a lot of lines.

So I tried using the "avoid dots" advice in https://wiki.python.org/moin/PythonSpeed/PerformanceTips as follows:

from functools import partial
split = str.split
tabsplit = partial(split, "\t")
strip = str.strip
endlinestrip = partial(strip, "\n")
def get_fields(tab_sep_line):
    return tabsplit(endlinestrip(tab_sep_line))

with open(filename, "r") as my_file:
    for line in my_file:
        fields = getfields(line)

However, this gave me a ValueError: empty separator for the return line of my get_fields function.

After investigating, what I understand is that the separator for the split method is the second positional argument, the first being the string object itself, which made functools.partial understand "\t" as the string to be split, and I was using the result of "\n".strip(tab_sep_line) as separator. Hence the error.

What woud you suggest to do instead?


Edit: I tried to compare three ways to implement the get_fields function.

Approach 1: Using plain .strip and .split

def get_fields(tab_sep_line):
    return tab_sep_line.strip("\n").split("\t")

Approach 2: Using lambda

split = str.split
strip = str.strip
tabsplit = lambda s : split(s, "\t")
endlinestrip = lambda s : strip(s, "\n")
def get_fields(tab_sep_line):
    return tabsplit(endlinestrip(tab_sep_line))

Approach 3: Using the answer provided by Jason S

split = str.split
strip = str.strip
def get_fields(tab_sep_line):
    return split(strip(tab_sep_line, "\n"), "\t")

Profiling indicates cumulated time for get_fields as follows:

Approach 1: 13.027

Approach 2: 16.487

Approach 3: 9.714

So avoiding dots makes a difference but using lambda seems counter-productive.


Solution

  • The advice to "avoid dots" for performance is (1) only something you should do if you actually have a performance problem, i.e. not if it's just called a lot of times but if it actually takes too much time, and (2) not going to be solved by using partial.

    The reason dots can take more time than locals is that python has to perform a lookup each time. But if you use partial, then there's an extra function call each time and it also copies and updates a dictionary each time and adds two lists. You're not gaining, you're losing.

    However, if you really want you can do:

    strip = str.strip
    split = str.split
    ...
    fields = split(strip(line), '\t')