Search code examples
sql-serversasprocessing-efficiencymemory-efficient

The fastest way to copy a huge dataset to server in SAS


Edited on Oct 21th

Background

Raw datasets (changed everday) stored on a MS-SQL based server: Sourcelib.RAW and a LOCAL excel file (remained unchanged).

I need to refresh a dataset WANTlocates in Targerlib. Currently I have SQL codes perform such task in 2-3 minutes. But I want to know if SAS can do the same thing while the processing time won't increase much.

  1. work.IMP is around 6M records and around 50 bytes per record.
  2. The ideal method should be very efficient because as time goes by, the size of raw datasets on the server would be incrediblely huge.
  3. The target file CANNOT be establied at one time and then append new data to it everyday. Because there maybe (even very unlikely) changes in previous data.

As per @Joe, I should allow the target file to just be updated, using proc compare, or update in data step. Here's a related question I posted How to use proc compare to update dataset

  1. Still more than 10GB free space on the sever, which is enough. Avaiable memory in my pc is around 3.5GB (not sure if it matters)
  2. Due to the architecture of server, it's very efficient to do it in MS-SQL. But I REALLY want to know if SAS can deal with this (when the server is not so "compatible")

Process

  1. First I import the data from excel file and then subset&tranpose it to be work.IMP. For some reasons, this file can only be created in this way everday. It CANNOT be store in the server.
  2. Then perform outer join for work.IMP and one raw dataset Sourcelib.RAW1 to get work.HAVE1. Please note that work.IMP is sorted but Sourcelib.RAW1 is unsorted. The outer join is only to (with some criteria) determine each data records.

i.e. case when a.COL1 is '' then b.COL1 else a.COL1 end as COL1

You can consider this process is to adjust Sourcelib.RAW1 by using work.IMP.

PS1: @sparc_spread suggests to do the import procedure directly to the server. But it won't have any benefit than do it in LOCAL. And a hash object here doesn't help either.

  1. Then I subset another raw datasets Sourcelib.RAW2 to work.temp, and then sort it to be work.HAVE2. (The data in Sourcelib.RAW2 is mostly not in order.)
  2. I concatenate work.HAVE1, work.HAVE2 by using proc append (because both two tables are huge) to be work.HAVE

PS2: The sorting in step3 is to avoid sorting at the end of step4. Actually the data Targerlib.WANT doesn't have to be in order. But it's better to be so.

  1. At the very end, I copy work.HAVE to server Targetlib.HAVE.

I did most of the thing in WORK, which only took me few minutes. But step5 could take me half an hour to finish the copy.

As per @Joe, this may mainly due to something related to network transit. I.E minmize the network transit

Question

Any way to improve step5? Or any modification of the whole process will improve the performance?


Solution

  • Couple of thoughts.

    First off, assuming this is a SAS dataset and not a SQL database or something else, options compress=binary; is a good idea assuming this is mostly numeric (and options compress=character; if not). Either will reduce the physical size of the dataset significantly in most cases.

    Second, 300MB is not very much in the scheme of things. My network would write that in less than a minute. What the conditions of your network are may drive some of the other choices you make; if the single slow point is simply copying data across it, for example, then you need to figure out how to reduce that at the cost of anything else you do.

    Assuming you don't change anything else, I would recommend writing have1 directly to the network as have, and then append have2 to it. IE, whatever step creates have1, have that directly write to the network. This includes sort steps, note: so if you create it then sort it, create it locally and sort it with out= the network library. This reduces the total amount of writing done (as you don't write a useless copy of have1 to your local drive). This helps if writing locally is a relevant cost to your total process, but won't help if it's almost entirely network congestion.

    Copying files with the OS's copy is almost always superior to any other method of copying, so if network congestion is the only factor you care about, you may want to make it locally (in WORK or in a local but constant directory, like C:\temp\ or similar) and then have the last step of your process be executing copy c:\temp\have.sas7bdat \\networklocation\whereitgoes\. This will usually outperform SAS methods for same, as it can take advantage of efficient techniques.

    PROC COPY is another option to get around network congestion; it's probably faster than PROC APPEND (if the local write-out is negligible, as it would be for me for <1 GB data), and has the advantage that it's a bit safer in case something happens during the transport. (Append should be fine also, but with COPY you know for sure nothing was changed from yesterday's file.)

    Finally, you may want to figure out some way to allow the target file to just be updated. This isn't all that hard to do in most cases. One example would be to keep a copy of yesterday's file, do a PROC COMPARE to today's file, and then include in the update file every record that is changed (regardless of what the change is). Then delete any matching old records from the target file and append the new records. This is very fast comparatively in terms of total records sent over the network, so saves a lot of time overall if network congestion is the main issue (but takes more CPU time to do the PROC COMPARE).