Search code examples
postgresqlpentahothroughput

Pentaho table input giving very less performance on postgres tables even for two columns in a table


The simple source read from postgres table(get 3 columns out of 20 columns) is taking huge time to read which I want to read to stream lookup where I fetch one column information

Here is the log:

2020/05/15 07:56:03 - load_identifications - Step **Srclkp_Individuals.0** ended successfully, processed 4869591 lines. ( 7632 lines/s)
2020/05/15 07:56:03 - load_identifications - Step LookupIndiv.0 ended successfully, processed 9754378 lines. ( 15288 lines/s)

The table input query is:

SELECT
    id as INDIVIDUAL_ID,
    org_ext_loc 
FROM
    individuals

This table is in postgres with 20 columns hardly & about 4.8 million rows..

This is for pentaho 7.1 data integration & server details below

**Our server information**:
OS : Oracle Linux 7.3
RAM : 65707 MB
HDD Capacity : 2 Terabytes
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
CPU(s):                16
CPU MHz:               2294.614

I am connecting to postgres using jdbc

Don't know what else I can do to get about 15K rows/sec throughput


Solution

  • When you use a table input & stream lookup, the way how pentaho runs the stream lookup is slower than when you use a database lookup. As @nsousa suggested, I checked that with dummy step and got to know that pentaho's way of handling is different for every type of step

    Even though database lookup & stream lookup come in same category, the performance for database lookup is better in this situation..

    Pentaho help gives some idea / suggestion regarding the same