Search code examples
pythonpandasdataframeappendconcatenation

More efficient way of appending dataframe


I was running some tests and found out that this piece of code here is inefficient. Looping in range of dates, if self.query is in df then appends the line, pretty straight forward. But I heard many opinions, that appending like this isn't efficient and even resource hungry.
My parquets have 4 columns with milions of lines - query phone_count desktop_count total, dropping 2 cols which means I have index, query and total and then the magic happens.

This code is working "fine", but now I'm looking for opinions from experienced users and possibly getting some hints.

Is there a way of doing the same in more efficient way? Tuples maybe?

Thank you, guys!

    for filename in os.listdir(directory):
        if filename.endswith(".parquet"):
            df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
            df.set_index("query", inplace=True)

            if self.lowercase == "on":
                df.index = df.index.str.lower()
            else:
                pass
            if self.sensitive == "on":                            
                self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query), axis=0))
            else:            
                self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query, re.IGNORECASE), axis=0))            


self.datafr = self.datafr.groupby(['query']).sum().sort_values(by='total', ascending=False)

Solution

  • You are repeating a few things with each loop:

    • The regex pattern does not need recompiling every time
    • Repeated DataFrame.append is slower than pd.concat([frame1, frame2, ...])
    • list.append is a lot faster than DataFrame.append

    Try this:

    option = re.IGNORECASE if self.lowercase == "on" else 0
    pattern = re.compile(self.query, option)
    subframes = []
    
    for filename in os.listdir(directory):
        if filename.endswith(".parquet"):
            df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
            df.set_index("query", inplace=True)
    
            # Not sure if this statement is necessary. The regex
            # is already IGNORECASE when lowercase == "on"
            if self.lowercase == "on":
                df.index = df.index.str.lower()
    
            # Multiple list.append
            subframes.append(df.filter(pattern, axis=0))
    
    # But a single pd.concat
    self.datafr = pd.concat([self.datafr] + subframes)