In postgreSQL 9.5:
I have a foreign table named: sheetheight (created by file_fdw) and a foreign table named: dzlog (created by postgres_fdw).
1- For joining foreign tables I have following query:
SELECT * from dzlog INNER JOIN sheetheight ON dzlog.ullid = sheetheight.ullid
;
and the EXPLAIN ANALYZE
returns back this for the above query:
-------------------------------------------------
Hash Join (cost=111.66..13688.18 rows=20814 width=2180) (actual time=7670.872.
.8527.844 rows=2499 loops=1)
Hash Cond: (sheetheight.ullid = dzlog.ullid)
-> Foreign Scan on sheetheight (cost=0.00..12968.10 rows=106741 width=150)
(actual time=0.116..570.571 rows=223986 loops=1)
Foreign File: D:\code\sources\sheetHeight_20151025_221244_0000000004987
6878996.csv
Foreign File Size: 18786370
-> Hash (cost=111.17..111.17 rows=39 width=2030) (actual time=7658.661..765
8.661 rows=34107 loops=1)
Buckets: 2048 (originally 1024) Batches: 32 (originally 1) Memory Usa
ge: 4082kB
-> Foreign Scan on dzlog (cost=100.00..111.17 rows=39 width=2030) (ac
tual time=47.162..7578.990 rows=34107 loops=1)
Planning time: 8.755 ms
Execution time: 8530.917 ms
(10 rows)
The output of query have two columns named ullid.
ullid,date,color,sheetid,dz0,dz1,dz2,dz3,dz4,dz5,dz6,dz7,ullid,sheetid,pass,...
2- For direct access to csv file and sql local table from python application, I have:
I have done the same query by not using the FDWs but the direct access to the csv file and postgreSQL local table from a python application using Pandas merge dataframe
. This join is raw join , so I first fetch the csv file and then fetch the sql table by using pandas library from python, then I merge the two dataframes based on the common columns
import pandas as pd
def rawjoin(query,connection=psycopg2.connect("dbname='mydb' user='qfsa' host='localhost' password='123' port=5433")):
query=("SELECT * FROM dzlog;")
firstTable= pd.read_csv('.\sources\sheetHeight_20151025_221244_000000000498768789.csv', delimiter=';', header=0)
secondTable =pd.read_sql(query,connection)
merged= pd.merge(firstTable, secondTable, on= 'ullid', how='inner')
return merged
The result is the joined data frame with one ullid column.
Any idea for this difference? I did other types of joins and the result from RAW access and FDW access are the same, other queries are as following:
q7=("SELECT dzlog.color FROM dzlog,sheetheight WHERE dzlog.ullid = sheetheight.ullid;")
q8=("SELECT sheetheight.defectfound FROM dzlog, sheetheight WHERE dzlog.ullid = sheetheight.ullid;")
q9=("SELECT dzlog.color, sheetheight.defectfound FROM dzlog, sheetheight WHERE dzlog.ullid= sheetheight.ullid;")
I don't know what your second example does, so it's hard to tell. Which library is used? Does it generate SQL or is the join performed in the application (which is almost always a performance loss)? If this results in an SQL statement, what is the statement?
The first query returns the column twice, because you asked it to return all columns from all involved tables, and both tables have this column, which the join condition forces to be equal.
You can write an SQL statement that will output the column only once like this:
SELECT *
FROM dzlog
JOIN sheetheight
USING (ullid);
This looks suspiciously like the code in your second example, doesn't it?