Search code examples
pythonpandaspandasql

Pandasql query of a list or dict of dataframes


Given a set of files - each one of which is read into a distinct dataframe - how might a pandasql query reference them?

In the following snippet we have a list of dataframes: but the same question would apply to a dict:

import pandas as pd
from  pandasql import sqldf
# Read in a set of 10 files each containing columns `id` and `estimate`
dfs = [pd.read_csv('file%d.csv' %d) for d in range(1,10+1)]
sql_res = sqldf("select d2.estimate - d1.estimate \
    from dfs[1] d1 join dfs[2] d2 on d2.id = d1.id", locals())

The dfs[1] and dfs[2] are showing what I'd like to do - but are not valid syntax. Any suggestions on how to structure this kind of problem in a way that pandasql can support?


Solution

  • You can tell pandasql a list of table names/aliases instead of just passing locals(), as per the docstring of PandasSQL.__call__ (can't find online version of the docs):

    :param env: Variables environment - a dict mapping table names to pandas dataframes.
    

    Note that you must put all tables that you want to query there, though.

    Here is a small example, using the PandasSQL class instead of sqldf as recommended in the docstring:

    import pandasql
    
    sql = pandasql.PandaSQL()
    
    
    data_frames = [
        pd.DataFrame({'a': [1,2,3]}),
        pd.DataFrame({'b': [1,4,5]})
    ]
    
    # create aliases for your dataframes
    env = {'t%d' % (i + 1): df for i, df in enumerate(data_frames)}
    
    # tell pandasql to use the aliases
    print(sql('select t1.* from t1 join t2 on t1.a = t2.b', env=env))