Search code examples
oracle-databaseparallel-processing

Oracle - select sorted rows with filter and export to csv: how to parallel execute job? start_id/end_id are page offsets, not columns


I want to filter, sort and dump data from a big table to a csv. Oracle 12c.

  • unload does filter but does not sort, but I want a stable sorting as I need to compare the data with files from other sources. Also it does not chunk output files. Of course if someone knows how to do this in unload please tell.
  • dp is the same as unload right?
  • expdp does not exist in the db; not installed?
  • I am exploring how to do this with DBMS_PARALLEL_EXECUTE package but I cannot find how to. The reason is simple: I want to chunk the rows by page, not by any column.

The main query I want to execute is:

select FIELD1, FIELD2
      from DATA
      where COMPANY_ID like :regex
      order by FIELD1   -- indexed and unique
      offset :start_id rows fetch next :page_size rows only;

I only want 2 column.

I think I cannot use CREATE_CHUNKS_BY_NUMBER_COL or CREATE_CHUNKS_BY_ROWID as

This procedure chunks the table (associated with the specified task) by the specified column. The specified column must be a NUMBER column. This procedure takes the MIN and MAX value of the column, and then divides the range evenly according to chunk_size.

  • and this table does not use physical rowid.

First question: Can I use PK? The PK is NUMBER, and continuous; but filtering by COMPANY_ID will make it not continuous. So for example from ID 0 to 500 I only need 2 rows, that breaks the job right? Will it fetch 2 for me or 500?

Second question: I am following https://oracle-base.com/articles/11g/dbms_parallel_execute_11gR2#user_defined_framework but I don't know when the chunk is executed and scheduled.

My code:

create or replace procedure compare (
  dir varchar2, 
  regex varchar2, 
  start_id integer, 
  end_id integer, 
  page_size integer
) as
   CURSOR c_query is 
      select FIELD1, FIELD2
      from DATA
      where COMPANY_ID like regex
      order by FIELD1
      offset start_id rows fetch next page_size rows only;
  output   utl_file.file_type;
  filename varchar2(100);
begin
  filename := 'split_' || start_id || '_' || end_id || '.csv';
  output := utl_file.fopen ( dir, filename, 'w' );
  utl_file.put_line ( output, 'FIELD1,FIELD2' );
  
  for row in c_query loop
    utl_file.put_line ( output, row.FIELD1 || ',' || row.FIELD2 );
  end loop;
  utl_file.fclose ( output );
end compare;
/


create or replace procedure loop_comparison(
    l_task     VARCHAR2 := 'test_task',
    regex VARCHAR2 := '90%',
    page_size  NUMBER := 50,
    dir varchar2 := 'DATA_PUMP_DIR'
)
  as
    l_sql_stmt VARCHAR2(32767);
    l_sql_step_stmt VARCHAR2(32767);
    row_count  NUMBER;
    p_start    NUMBER := 0;
    p_end      NUMBER := 0;
BEGIN
  DBMS_PARALLEL_EXECUTE.create_task (task_name => l_task);

  select count(ID) into row_count from DATA where COMPANY_ID like regex;
  
  p_end := page_size;
  
  l_sql_stmt := 'BEGIN compare(''' || dir || ''',''' || regex || ''', :start_id, :end_id, ' || page_size || '); END;';
  WHILE p_start < row_count
  LOOP
    l_sql_step_stmt := 'select ' || p_start || ' as start_id, ' || p_end || ' as end_id from dual';
    DBMS_PARALLEL_EXECUTE.create_chunks_by_sql(task_name  => l_task,
                                                sql_stmt    => l_sql_step_stmt,
                                                by_rowid    => FALSE);

    DBMS_PARALLEL_EXECUTE.run_task(task_name      => l_task,
                                   sql_stmt       => l_sql_stmt,
                                   language_flag  => DBMS_SQL.NATIVE,
                                   parallel_level => 20,
                                   job_class      => 'test_task_class');
    DBMS_OUTPUT.put_line(
           TO_CHAR(SYSTIMESTAMP, 'yyyy-mm-dd hh24:mi:ss.ff')
        || '  '
        || DBMS_PARALLEL_EXECUTE.task_status(l_task)
    );
    p_start := p_start + page_size;
    p_end := p_end + page_size;
  END LOOP;
end loop_comparison;
/

After exec loop_comparison, I find this:

SQL> select task_name, chunk_type, status from USER_PARALLEL_EXECUTE_TASKS;

   TASK_NAME    CHUNK_TYPE     STATUS
____________ _____________ __________
test_task    UNDECLARED    CREATED

SQL> select * from USER_PARALLEL_EXECUTE_CHUNKS;

no rows selected

I suppose it's never running.


Solution

  • If you are doing an ORDER BY, then you will be passing over the entire set of rows that match you criteria anyway, so its unlikely to see any massive benefit from splitting this up, unless the dominant cost is the sorting element.

    If that is the case, I'd tackle that, eg, bumping up PGA or having faster TEMP tablespace IO.

    'dp' is datapump, so not appropriate for csv.

    Because you are sorting by FIELD1, then if you know some upper/lower limits for FIELD1, then you could use that for chunking/sorting because by definition each chunk will have mutually exclusive data. You'd probably end up with differing file sizes.

    If you want to use DBMS_PARALLEL_EXECUTE for chunked strings, you can have a lookup table, eg

    ID  LO   HI
    --  ---  ---
    1   AAA  FZZ
    2   GAA  QZZ
    3   RAA  ZZZ
    

    and then create three chunks with CREATE_CHUNK_BY_SQL to load values 1-1, 2-2, 3-3 as tasks.

    Then the SQL or PLSQL that does your work (ie, the routine that DBMS_PARALLEL_EXECUTE will schedule) will do

    select ...
    from   my_table, lookup_table
    where  field1 >= lo and field1 < hi
    and    lookup_id = :startid
    order by field1