I was running some tests and found out that this piece of code here is inefficient. Looping in range of dates, if self.query
is in df
then appends the line, pretty straight forward. But I heard many opinions, that appending like this isn't efficient and even resource hungry.
My parquets have 4 columns with milions of lines - query
phone_count
desktop_count
total
, dropping 2 cols which means I have index
, query
and total
and then the magic happens.
This code is working "fine", but now I'm looking for opinions from experienced users and possibly getting some hints.
Is there a way of doing the same in more efficient way? Tuples maybe?
Thank you, guys!
for filename in os.listdir(directory):
if filename.endswith(".parquet"):
df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
df.set_index("query", inplace=True)
if self.lowercase == "on":
df.index = df.index.str.lower()
else:
pass
if self.sensitive == "on":
self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query), axis=0))
else:
self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query, re.IGNORECASE), axis=0))
self.datafr = self.datafr.groupby(['query']).sum().sort_values(by='total', ascending=False)
You are repeating a few things with each loop:
DataFrame.append
is slower than pd.concat([frame1, frame2, ...])
list.append
is a lot faster than DataFrame.append
Try this:
option = re.IGNORECASE if self.lowercase == "on" else 0
pattern = re.compile(self.query, option)
subframes = []
for filename in os.listdir(directory):
if filename.endswith(".parquet"):
df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
df.set_index("query", inplace=True)
# Not sure if this statement is necessary. The regex
# is already IGNORECASE when lowercase == "on"
if self.lowercase == "on":
df.index = df.index.str.lower()
# Multiple list.append
subframes.append(df.filter(pattern, axis=0))
# But a single pd.concat
self.datafr = pd.concat([self.datafr] + subframes)