Background
Raw datasets (changed everday) stored on a MS-SQL based server: Sourcelib.RAW
and a LOCAL
excel file (remained unchanged
).
I need to refresh a dataset WANT
locates in Targerlib
. Currently I have SQL codes perform such task in 2-3 minutes. But I want to know if SAS can do the same thing while the processing time won't increase much.
work.IMP
is around 6M records and around 50 bytes per record.CANNOT
be establied at one time and then append new data to it everyday. Because there maybe (even very unlikely) changes in previous data. As per @Joe, I should allow the target file to just be updated, using proc compare
, or update
in data step
. Here's a related question I posted How to use proc compare to update dataset
"compatible"
)Process
subset&tranpose
it to be work.IMP
. For some reasons, this file can only be created in this way everday. It CANNOT
be store in the server. work.IMP
and one raw dataset Sourcelib.RAW1
to get work.HAVE1
. Please note that work.IMP
is sorted but Sourcelib.RAW1
is unsorted. The outer join is only to (with some criteria) determine each data records.i.e. case when a.COL1 is '' then b.COL1 else a.COL1 end as COL1
You can consider this process is to adjust Sourcelib.RAW1
by using work.IMP
.
PS1: @sparc_spread suggests to do the import procedure directly to the server. But it won't have any benefit than do it in LOCAL
. And a hash object
here doesn't help either.
Sourcelib.RAW2
to work.temp
, and then sort
it to be work.HAVE2
. (The data in Sourcelib.RAW2
is mostly not in order.)work.HAVE1
, work.HAVE2
by using proc append
(because both two tables are huge) to be work.HAVE
PS2: The sorting in step3
is to avoid sorting at the end of step4
. Actually the data Targerlib.WANT
doesn't have to be in order. But it's better to be so.
work.HAVE
to server Targetlib.HAVE
. I did most of the thing in WORK
, which only took me few minutes. But step5
could take me half an hour to finish the copy.
As per @Joe, this may mainly due to something related to network transit. I.E
minmize the network transit
Question
Any way to improve step5
? Or any modification of the whole process will improve the performance?
Couple of thoughts.
First off, assuming this is a SAS dataset and not a SQL database or something else, options compress=binary;
is a good idea assuming this is mostly numeric (and options compress=character;
if not). Either will reduce the physical size of the dataset significantly in most cases.
Second, 300MB is not very much in the scheme of things. My network would write that in less than a minute. What the conditions of your network are may drive some of the other choices you make; if the single slow point is simply copying data across it, for example, then you need to figure out how to reduce that at the cost of anything else you do.
Assuming you don't change anything else, I would recommend writing have1
directly to the network as have
, and then append have2
to it. IE, whatever step creates have1
, have that directly write to the network. This includes sort
steps, note: so if you create it then sort it, create it locally and sort it with out=
the network library. This reduces the total amount of writing done (as you don't write a useless copy of have1
to your local drive). This helps if writing locally is a relevant cost to your total process, but won't help if it's almost entirely network congestion.
Copying files with the OS's copy is almost always superior to any other method of copying, so if network congestion is the only factor you care about, you may want to make it locally (in WORK or in a local but constant directory, like C:\temp\
or similar) and then have the last step of your process be executing copy c:\temp\have.sas7bdat \\networklocation\whereitgoes\
. This will usually outperform SAS methods for same, as it can take advantage of efficient techniques.
PROC COPY
is another option to get around network congestion; it's probably faster than PROC APPEND
(if the local write-out is negligible, as it would be for me for <1 GB data), and has the advantage that it's a bit safer in case something happens during the transport. (Append should be fine also, but with COPY you know for sure nothing was changed from yesterday's file.)
Finally, you may want to figure out some way to allow the target file to just be updated. This isn't all that hard to do in most cases. One example would be to keep a copy of yesterday's file, do a PROC COMPARE
to today's file, and then include in the update file every record that is changed (regardless of what the change is). Then delete any matching old records from the target file and append the new records. This is very fast comparatively in terms of total records sent over the network, so saves a lot of time overall if network congestion is the main issue (but takes more CPU time to do the PROC COMPARE).