Search code examples
pythonsqlpostgresqlprocedureplpython

PostgreSQL Procedure using python write dataframe to table


I am developing a PostgreSQL Procedure using the plpython3u extension which let you use python procedural language into PostgreSQL.

With the following code using plpy, I am able to retrive data form table and put it into pandas dataframe.

CREATE OR REPLACE PROCEDURE public.plpy_proc_clas_full(
    )
LANGUAGE 'plpython3u'
AS $BODY$
  import pandas as pd
  
  data_lt = plpy.execute('SELECT "key", "value" FROM public."<your-table>" ORDER BY "key"'); #PLyResult --> List or Dictionary
  data_df_x = pd.DataFrame.from_records(data_lt)['key'];
  data_df_y = pd.DataFrame.from_records(data_lt)['value'];
  df = pd.concat([data_df_x, data_df_y], axis=1).values
  
  return df;
$BODY$;

But how can I write back the pandas dataframe to a table (for example after a few data manipulations in python)?


Solution

  • After some time, I read a few articles which were using libraries like psycopg2 or plpy.connect to open a connection to destination database.

    Personally, I do not find any reason to re-open a connection while manipulating data (in and out) within the same runtime/database.

    I thought of a solution which makes the job, although not sure about performance while handling huge tables because of using an iteration into the dataframe which case multiple insert statements, more specific one insert per line into dataframe, which is not bulk.

    CREATE OR REPLACE PROCEDURE public.plpy_proc_test(
        )
    LANGUAGE 'plpython3u'
    AS $BODY$
      import pandas as pd
      
      data_lt = plpy.execute('SELECT "key", "value" FROM public."ml_train" ORDER BY "key"'); #PLyResult --> List or Dictionary
      data_df_x = pd.DataFrame.from_records(data_lt)['key'];
      data_df_y = pd.DataFrame.from_records(data_lt)['value'];
      
      df = pd.concat([data_df_x, data_df_y], axis=1)
      
      for index, row in df.iterrows():
        plan = plpy.prepare('INSERT INTO test (y_hat, y_act) VALUES ($1, $2)', ['numeric', 'numeric'])
        plpy.execute(plan, [row[0], row[1]])
    $BODY$;