What is the best way to integrate SAS with Hadoop without losing the parallel processing capacity of Hadoop

I am trying to understand the integration between SAS and Hadoop. From what I understand, SAS processes like proc sql can only work against a SAS data set, I cannot issue proc sql against a text file on a hadoop node. Is it correct?

If yes, then I need to uses some ETL jobs to first take the data out of HDFS and convert it to SAS tables. But if I do that, I will lose the parallel processing capabilties of Hadoop , am I right?

So what is the ideal way of integrating SAS and Hadoop and still use the parallel processing power of Hadoop?

I understand you can call a map reduce job from inside SAS, but can a map reduce job be written in SAS? I think not.

Solution

One of the major pushes at SAS Global Forum 2015 was actually the new options for connections to Hadoop and Teradata. FEDSQL and DS2, new in SAS 9.4, exist in part specifically to enable SAS to better work with Hadoop. You can execute code directly in your Hadoop node, as well as do a lot more efficient processing in SAS directly.

Assuming you have the most recent release of SAS (9.4 TS1M3), you can look at the SAS Release Notes (Current as of 9/3/2015; in the future this will point to later versions). That includes information like the following:

In the second maintenance release for SAS 9.4, the SAS In-Database Code Accelerator for Hadoop runs the DS2 data program as well as the thread program inside the database. Several new functions have been added. The HTTP package enables you to construct an HTTP client to access web services and a new logger enables logging of HTTP traffic. A connection string parameter is available when instantiating an SQLSTMT package.

SAS FedSQL is a SAS proprietary implementation of the ANSI SQL:1999 core standard. It provides support for new data types and other ANSI 1999 core compliance features and proprietary extensions. FedSQL provides data access technology that brings a scalable, threaded, high-performance way to access, manage, and share relational data in multiple data sources. FedSQL is a vendor-neutral SQL dialect that accesses data from various data sources without submitting queries in the SQL dialect that is specific to the data source. In addition, a single FedSQL query can target data in several data sources and return a single result table. The FEDSQL procedure enables you to submit FedSQL language statements from a Base SAS session. The first maintenance release for SAS 9.4 adds support for Memory Data Store (MDS), SAP HANA, and SASHDAT data sources.

In the second maintenance release for SAS 9.4, SAS FedSQL supports Hive, HDMD, and PostgreSQL data sources. Data types can be converted to another data type. You can add DBMS-specific clauses to the end of the CREATE INDEX statement, and you can write a SASHDAT file in compressed format.

In the third maintenance release of SAS 9.4, FedSQL has added support for HAWQ and Impala distributions of Hadoop, enhanced support for Impala, new data types, and more.

Hadoop Support

The first maintenance release for SAS 9.4 enables you to use the SPD Engine to read, write, and update data in a Hadoop cluster through the HDFS. In addition, you can now use the HADOOP procedure to submit configuration properties to the Hadoop server.

In the second maintenance release for SAS 9.4, performance has been improved for the SPD Engine access to Hadoop. The SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS is available from the support.sas.com third-party site for Hadoop.

In the third maintenance release of SAS 9.4, access to data stored in HDFS is enhanced with a new distributed lock manager and therefore easier access to Hadoop clusters using Hadoop configuration files.

Beyond this, there is extensive documentation and papers written on the subject; documentation for the SAS Connector for Hadoop, for example.