Search code examples
kdb

How to update a local table remotely?


I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?

Currently I have a dummy table called t on the server along with the following variables...

t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy

and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...

index:0
normTable:h"0#t"

batches:{[idx;divy;anty;seltr]
    if[not idx=divy;
    batch:select[(anty;100000)] from seltr;
    `normTable upsert batch;
    idx+::1;
    divy:divy;
    anty+:100000;
    seltr:seltr;
    batches[idx;divy;anty;seltr]];
    idx::0}

I call that function using the following command...

 batches[index;h"divyUP";0;h"selector"]

The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).

I could move the batches function to the remote server but then how would I update my local normTable remotely?

Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...

batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"

Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?


Solution

  • I would suggest few changes as you are trying to connect to remote server:

    • Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
    • Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
    • Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.

    Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.

    Create Batch processing function

    This function will run on server and read small batch of data from table using indices and return the required data.

    q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*"; 
                      neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
    

    It takes 2 arguments- starting index and batch size to work on.

    This function will finally call upd function on local macine asynchronously and will pass 2 arguments.

    • Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
    • Data from current batch request

    Create Callback function

    Result from batch processing function will come into this function.

    If index > 0 that means there is more data to process and next batch should start form this index.

    q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
    

    Create Main function to start process

    q)fetch:{[ind] h (batch;ind;size)}
    

    Finally open connection, create table variable and run fetch function.

    q) h:hopen `:server:port
    q) t:()
    q) size:100
    q) fetch 0
    

    Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.

    Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.