Optimize insert operation LOB dblink

I am trying to do an insert of the values of a table accessible via db_link and containing a LOB column. However, the performance is pretty bad. I have tried using cursors and bulk collect but they don't seem to work with LOBs in remote databases. Is there any other option to optimize it?

This is my query. It works without the last two filter conditions, since operations on the LOB column are not allowed in the insert query, but ideally we would like to include it within the same operation.

-- Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
-- PL/SQL Release 12.1.0.2.0 - Production
-- "CORE    12.1.0.2.0  Production"
-- TNS for Linux: Version 12.1.0.2.0 - Production
-- NLSRTL Version 12.1.0.2.0 - Production


INSERT INTO raw_data (
    "URL",
    tipo,
    fecha_evento,
    fecha_registro,
    usuario,
    cliente,
    contrato,
    referer,
    correlation_id,
    session_id,
    tracking_cookie,
    "JSON",
    "APPLICATION"
)
    SELECT
        r.*
    FROM
        schema.remote_table@dblink r
    WHERE
        fecha_evento BETWEEN trunc(SYSDATE - INTERVAL '1' MONTH, 'MONTH') AND trunc(SYSDATE, 'MONTH') - INTERVAL '1' DAY
        AND url IN (
            'a',
            'b',
            'c',
            'd'
        )   AND NOT JSON_EXISTS ( r."JSON", '$.response.pasofin.codoferta' )
        AND "URL" NOT IN (
            'e',
            'f'
        );

Solution

LOB operations over dblinks have long been an issue, though they've improved in more recent versions. Internally when you copy a LOB over via INSERT SELECT or CTAS, Oracle will for each row do a loop on internal dbms_lob calls (or something underneath it) to incrementally fetch bytes into a buffer and write them to the target LOB segment. Think of it as a nested loop - for each row, do an inner loop as many times as it takes to assemble a LOB (and do it again for each additional LOB in the row). Clearly this is a lot more work than a straight pull of a simple rowset. This buffer that each LOB piece is copied into has historically been quite small, resulting in many round trips so that network latency adds up and it can really make pulling LOBs over quite slow. One wishes it were possible to control the buffer size so fewer round trips are necessary, but unless there's an underscore parameter that does this, I don't think we have any control over it. I have noticed that LOB movement has improved in recent versions, but I don't recall if it had yet in 12.1.

There are several workarounds to speed this up.

Pull all rows with a LOB length of < 4000 bytes, casting them to varchar2(4000) so they are transferred over the link as varchar2. All uses of dbms_lob while doing this must use the @dblink remote versions, not the local ones. Due to characterset mismatch issues you might need to makes this varchar2(2000) or even varchar2(1000). This is very fast. Then pull the longer ones over as LOBs, which hopefully is far fewer rows. If most of your LOBs are less than 4K, this can really speed things up. If most of your LOBS are greater than 4K, however, you get no benefit.
You can create a procedure on the remote database that takes a column name and a table name and reads the LOBs from all rows locally in a determined row order, converting them into a collection of varchar2(32767) records (this can then be converted to BLOB and a collection of RAW and then lz_compressed if desired). Return the collection via an OUT parameter along with another collection of byte offsets showing where each row's LOB starts. The calling database receives the output collections and reverses this process locally, reconstructing the original LOBs and writes them to the target. This is rather complicated but works (I've done it successfully) and this is a lot faster than native LOB movement over a dblink. Because of its complexity and liability to bugs however, it really isn't an approach I would highly recommend. But it is an option.
Use multiple processes (e.g. via dbms_scheduler) to break up the table extract into roughly equal portions and tackle the problem that way with brute force. Reassemble thread work tables (or better, partitions of one table) at the end to collect your final result. 10 threads will move those LOBs almost 10x faster than a single session can (you do have the overhead of writing the data twice, but that's very small compared to the benefit of parallelizing the network pull). This can be done either by itself, or in combination with the other techniques mentioned above.