Search code examples
sqldatabasepentahodata-warehousepdi

Data Insertion takes too long Pentaho (PDI)


I'm trying to insert 500k+ rows into a table but it takes too long. The speed at the beginning is around 1000 r/s but it goes down to 100 r/s quickly after that.

Here is the code I'm executing in my transformation

SELECT 
    accident_index,
    longitude,
    latitude,
    accident_severity as accident_severity_fk,
    number_of_vehicles,
    number_of_casualties,
    EXTRACT('year' from formatted_date)*10000 + EXTRACT('month' from formatted_date)*100+EXTRACT('day' from formatted_date) as date_fk,
    r.road_information_pk as road_information_fk,
    speed_limit,
    p.pedestrian_crossing_information_pk as pedestrian_crossing_information_fk,
    l.light_conditions_pk as light_conditions_fk,
    w.weather_conditions_pk as weather_conditions_fk,
    rs.road_surface_conditions_pk as road_surface_conditions_fk,
    s.special_conditions_at_site_pk as special_conditions_at_site_fk,
    c.carriageway_hazards_pk as carriageway_hazards_fk,
    urban_or_rural_area as urban_or_rural_area_fk,
    time
FROM staging.accidents ac
left join core.road_information r on r.road_type = ac.road_type and r.junction_control = coalesce(ac.junction_control,'Unknown')
left join core.pedestrian_crossing_information p on p.pedestrian_crossing_human_control = coalesce(ac.pedestrian_crossing_human_control,'Unknown')
    and p.pedestrian_crossing_physical_facilities = coalesce(ac.pedestrian_crossing_physical_facilities,'Unknown')
left join core.light_conditions as l on l.light_conditions = coalesce(ac.light_conditions,'Unknown')
left join core.weather_conditions as w on w.weather_conditions = coalesce(ac.weather_conditions,'Unknown')
left join core.road_surface_conditions as rs on rs.road_surface_conditions = coalesce(ac.road_surface_conditions,'Unknown')
left join core.special_conditions_at_site as s on s.special_conditions_at_site = coalesce(ac.special_conditions_at_site,'Unknown')
left join core.carriageway_hazards as c on c.carriageway_hazards = coalesce(ac.carriageway_hazards,'Unknown')

I'm using a Insert/Update tranformation. For testing purposes, I tried inserting only a couple of rows and avoid the joins but it goes even a bit slower. Any idea on how I can improve the speed?


Solution

  • Your speed problem is more likely to be on write than on read. Best way to check this is disabling the hop coming into the insert/update step and running the transformation. It should run at several thousand or even tens of thousand of rows per second. If that’s the case, optimising your read query is pointless, the bottleneck is on the output.

    Insert/Update is notoriously slow, as it needs round trips to lookup each row and decide whether to insert or update. If the target table is initially empty then there’s no need to insert/update. Just sort the rows on the lookup keys, pick up the values you want for each attribute and use a table output step.

    If you need to update the table, then there are different possible strategies to speed that up, depending on how much of that dataset is new vs existing:

    • if only a few rows need updating and most are inserted then you can use a table output step to insert them all, rely on a db constraint to reject duplicates and have an Update step after the table output step and set it as “error handling”. PDI will try to insert all rows, a few of them fail, are sent down the error handling hop and those are updated. Advantage is that only actual updates use the update step, so the number of round trips goes to almost zero. In this scenario the “skip lookup” checkbox should be selected. No point in doing the lookup, you already know there’s an existing row, just update.
    • If the ratio of updates v inserts is higher then you can read from the target table, do a merge rows (diff) and then a switch/case to send new rows to a table output step, deleted rows to a delete step, changed rows to an update step (with skip lookup box checked) and identical rows ignored.