I have a excel file that each column is divided into sub columns . I tried to import it in following way
PROC IMPORT DATAFILE="path\IRCC_M_PRadmiss_0002_E (1).xls"
OUT=immigrants
DBMS=XLS REPLACE;
getnames=yes;
RUN;
But in this way i am getting column names as observations. I want to import it in more structured way so that columns can be accessed by their names.
Link to the excel file : https://files.fm/u/bapkwx7mk
The Excel file appears to contain a pivot table based on some other categorical data source. If you have access to that data source, import it instead.
Suppose the Excel file is a download from some government website and the original data is unavailable.
As you found out a single simple IMPORT will make all the columns character due to mixed types in the header cells and the values in the column headers pollute the flat import.
A pivot table has three parts
Proc IMPORT
can read ranges of cells from an Excel file and this gives rise to the following strategy for unpivoting the pivot table:
IMPORTING with GETNAMES=NO
will produce a raw starting point where each column from Excel will be named F<nn>
. When the raw data is transposed you will have a new column named _NAME_
whose values will be "F<nn>"
. The column names have just become data.
Looking across the cells in the data portion, each row has 98 columns that would become 98 rows when transposed. However, the totals columns can be discarded, and you will be left with 72 rows (in transpose) corresponding to 6 years * 12 months/year.
The original report can be reproduced using Proc TABULATE
once you have transformed the IMPORTs into categorical time.
Example:
The values of the _NAME_
variable are mapped to a corresponding month using a custom informat built from the column header transpose.
* read column headers;
PROC IMPORT DATAFILE="C:\Users\Richard\Downloads\IRCC_M_PRadmiss_0002_E (1).xls"
REPLACE OUT=headers_raw
;
RANGE="A3:CT5";
GETNAMES=NO;
RUN;
* read row header and data part (the counts);
PROC IMPORT DATAFILE="C:\Users\Richard\Downloads\IRCC_M_PRadmiss_0002_E (1).xls"
REPLACE OUT=data_raw
;
RANGE="A6:CT204";
GETNAMES=NO;
RUN;
proc transpose data=headers_raw
out=headers_tall(
drop=_label_
where=(col3 not contains 'Total' and col2 not contains 'Total')
);
var F2-F98;
run;
* construct special CNTLIN data set for custom informat named FIELD_TO_MONTH;
data headers_cntlin;
length start $32 label 8;
if _n_ = 1 then do;
hlo = 'O';
label = .;
output;
hlo = '';
end;
set headers_tall;
retain year;
if not missing (col1) then year = input(col1,4.);
retain fmtname 'FIELD_TO_MONTH' type 'i';
month = input (cats('01',col3,year), date9.);
start = _name_;
label = month;
output;
keep fmtname start label type hlo;
run;
* build the custom informat;
proc format cntlin=headers_cntlin;
run;
* F1 is the one-dimensional row header;
* F2-F98 are the data cells;
* after transpose the column names will be data in column _NAME_;
proc transpose data=data_raw
out=data_tall(
drop=_label_
rename=(F1=Country col1=count)
index=(_name_)
);
by F1 notsorted;
var F2-F98;
run;
data immigration;
set data_tall;
label Country=' ';
* map field name to month per column headers;
month = input(_name_,FIELD_TO_MONTH.);
if not missing(month);
* variables for wider temporal aggregations in time hierarchy year/qtr/month;
year = year(month);
qtr = qtr(month); format qtr NLSTRQTR2.1;
format month monname3.;
keep country year qtr month count;
run;
* reproduce the Excel pivot table in SAS;
ods html file='immigration.html' style=plateau;
proc tabulate data=immigration;
class year qtr month / order=internal;
class country;
var count;
table
country
,
( year=''
* ( qtr=''
* ( month=''
all
)
all
)
all
)
* count='' * sum=' ' * f=comma12.
;
run;
ods html close;
Headers raw (from range)
Data raw (from range)
Headers tall
Headers CNTLIN for custom informat FIELD_TO_MONTH
Data tall (72 rows per country for 6 years and 12 months / year)
Immigration after applying informat to field names (F2....) to compute variables for temporal hierarchy
Original table reproduced by TABULATE