Search code examples
pythonpandasflaskscalability

Is using a Pandas Dataframe as a read-only table scalable in a Flask App?


I'm developing a small website in Flask that relies on data from a CSV file to output data to a table on the frontend using JQuery.

The user would select an ID from a drop-down on the front-end, then a function would run on the back-end where the ID would be used as a filter on the table to return data. The data returned would usually just be a single column from the dataframe as well.

The usual approach, from my understanding, would be to load the CSV data into a SQLite DB on startup and query using SQL methods in python at runtime.

However, in my case, the table is 15MB in size (214K rows) and will never grow past that point. All the data will be as is for the duration of the Apps lifecycle.

As such, would it be easier and less hassle to just load the dataframe table into memory and just filter on a copy of it when requests come in? Is that scalable or am I just kicking a can down the road?

Example:

app = Flask(__name__)


dir_path = os.path.abspath(os.path.dirname(__file__))


with app.app_context():
   print("Writing DB on startup")

   query_df = pd.read_csv(dir_path+'/query_file.csv')


@app.route('/getData', methods=["POST"])
def get_data():

   id = request.get_json()

   print("Getting rows....")

   data_list = sorted(set(query_df[query_df['ID'] == id]['Name'].tolist()))

   return jsonify({'items': data_list, 'ID': id})

This may be a tad naive on my end but I could not find a straight answer for my particular use-case.


Solution

  • This line of code can be made much faster without adding any new dependencies, just by using the tools that Pandas gives you.

    data_list = sorted(set(query_df[query_df['ID'] == id]['Name'].tolist()))
    

    The following optimizations can be made:

    • sorted() can be replaced by pre-sorting the dataframe.
    • set() can be replaced by dropping duplicates with the same ID and Name.
    • query_df[query_df['ID'] == id] requires searching the entire dataframe for matching ID values, and can be replaced with an index.

    To prepare the dataframe, on the startup of your program, after reading the dataframe with read_csv(), you would do the following:

    name_lookup_series = query_df \
        .sort_values(['ID', 'Name']) \
        .drop_duplicates(['ID', 'Name']) \
        .set_index('ID')['Name']
    

    To look up any particular value, you would do the following:

    name_lookup_series.loc[[id_to_look_up]].tolist()
    

    Benchmarking this, it is roughly 100x faster, using the following benchmark program:

    import pandas as pd
    import numpy as np
    np.random.seed(92034)
    
    N = 200000
    
    df = pd.DataFrame({
        'ID': np.random.randint(0, N, size=N),
        'Name': np.random.randint(0, N, size=N),
    })
    df['ID'] = 'ID' + df['ID'].astype('str')
    df['Name'] = 'N' + df['Name'].astype('str')
    
    print("Test dataframe")
    print(df)
    id_to_look_up = np.random.choice(df['ID'])
    print("Looking up", id_to_look_up)
    print("Result, method 1", sorted(set(df[df['ID'] == id_to_look_up]['Name'].tolist())))
    
    %timeit sorted(set(df[df['ID'] == id_to_look_up]['Name'].tolist()))
    name_lookup_series = df.copy() \
        .sort_values(['ID', 'Name']) \
        .drop_duplicates(['ID', 'Name']) \
        .set_index('ID')['Name']
    print("Result, method 2", name_lookup_series.loc[[id_to_look_up]].tolist())
    %timeit name_lookup_series.loc[[id_to_look_up]].tolist()