A profiling of my code shows that methods split
and strip
of str
objects are amongst the the most called functions.
It happens that I use constructs such as:
with open(filename, "r") as my_file:
for line in my_file:
fields = line.strip("\n").split("\t")
And some of the files to which this is applied have a lot of lines.
So I tried using the "avoid dots" advice in https://wiki.python.org/moin/PythonSpeed/PerformanceTips as follows:
from functools import partial
split = str.split
tabsplit = partial(split, "\t")
strip = str.strip
endlinestrip = partial(strip, "\n")
def get_fields(tab_sep_line):
return tabsplit(endlinestrip(tab_sep_line))
with open(filename, "r") as my_file:
for line in my_file:
fields = getfields(line)
However, this gave me a ValueError: empty separator
for the return
line of my get_fields
function.
After investigating, what I understand is that the separator for the split
method is the second positional argument, the first being the string object itself, which made functools.partial
understand "\t"
as the string to be split, and I was using the result of "\n".strip(tab_sep_line)
as separator. Hence the error.
What woud you suggest to do instead?
Edit:
I tried to compare three ways to implement the get_fields
function.
Approach 1: Using plain .strip
and .split
def get_fields(tab_sep_line):
return tab_sep_line.strip("\n").split("\t")
Approach 2: Using lambda
split = str.split
strip = str.strip
tabsplit = lambda s : split(s, "\t")
endlinestrip = lambda s : strip(s, "\n")
def get_fields(tab_sep_line):
return tabsplit(endlinestrip(tab_sep_line))
Approach 3: Using the answer provided by Jason S
split = str.split
strip = str.strip
def get_fields(tab_sep_line):
return split(strip(tab_sep_line, "\n"), "\t")
Profiling indicates cumulated time for get_fields
as follows:
Approach 1: 13.027
Approach 2: 16.487
Approach 3: 9.714
So avoiding dots makes a difference but using lambda
seems counter-productive.
The advice to "avoid dots" for performance is (1) only something you should do if you actually have a performance problem, i.e. not if it's just called a lot of times but if it actually takes too much time, and (2) not going to be solved by using partial
.
The reason dots can take more time than locals is that python has to perform a lookup each time. But if you use partial
, then there's an extra function call each time and it also copies and updates a dictionary each time and adds two lists. You're not gaining, you're losing.
However, if you really want you can do:
strip = str.strip
split = str.split
...
fields = split(strip(line), '\t')