Search code examples
python-3.xdatabasebigdataapplyrowwise

What are faster ways of reading big data set and apply row wise operations other than pandas and dask?


I am working on a code where I need to populate a set of data structure based on each row of a big table. Right now, I am using pandas to read the data and do some elementary data validation preprocess. However, when I get to the rest of process and putting data in the corresponding data structure, it takes considerably long time for the loop to be completed and my data structures gets populated. For example, in the following code I have a table with 15 M records. Table has three columns and I create a foo() object base on each row and add it to a list.

# Profile.csv 
# Index    | Name | Family| DB
# ---------|------|-------|----------
# 0.       | Jane | Doe   | 08/23/1977
# ...
# 15000000 | Jhon | Doe   | 01/01/2000

class foo():
    def __init__(self, name, last, bd):
        self.name = name
        self.last = last
        self.bd = bd

def populate(row, my_list):
    my_list.append(foo(*row))

# reading the csv file and formatting the date column
df = pd.read_csv('Profile.csv')
df['DB'] = pd.to_datetime(df['DB'],'%Y-%m-%d')

# using apply to create an foo() object and add it to the list
my_list = []
gf.apply(populate, axis=1, args=(my_list,))

So the after using pandas to convert the string date to the date object, I just need to iterate over the DataFrame to creat my object and add them to the list. This process is very time taking (in my real example it is even taking more time since my data structure is more complex and I have more columns). So, I am wondering what is the best practice in this case to enhance my run time. Should I even use pandas to read my big tables and process through them row by row?


Solution

  • There are multiple approaches for this kind of situation, however, the fastest and most effective method is using vectorization if possible. The solution for the example I demonstrated in this post using vectorization could be as follows:

    my_list = [foo(*args) for args in zip(df["Name"],df["Family"],df["BD"])]
    

    If the vectorization is not possible, converting the data framce to a dictionary could significantly improve the performance. For the current example if would be something like:

    my_list = []
    dc = df.to_dict()
    for i, j in dc.items():
        my_list.append(foo(dc["Name"][i], dc["Family"][i], dc["BD"][i]))
    

    The last solution is particularly very effective if the type of structures and processes are more complex.