Search code examples
excelimportsasxlstabular

How to import excel file when each column divides into sub columns in SAS


I have a excel file that each column is divided into sub columns . I tried to import it in following way

PROC IMPORT DATAFILE="path\IRCC_M_PRadmiss_0002_E (1).xls"
     OUT=immigrants
     DBMS=XLS REPLACE;
     getnames=yes;
    
RUN;

Image of the dataset after import But in this way i am getting column names as observations. I want to import it in more structured way so that columns can be accessed by their names. Link to the excel file : https://files.fm/u/bapkwx7mk


Solution

  • The Excel file appears to contain a pivot table based on some other categorical data source. If you have access to that data source, import it instead.

    Suppose the Excel file is a download from some government website and the original data is unavailable.

    As you found out a single simple IMPORT will make all the columns character due to mixed types in the header cells and the values in the column headers pollute the flat import.

    A pivot table has three parts

    • The column headers (A crossing of class variables)
      • The table in your file crosses three variables
        Year, Quarter, and Month
    • The row headers (A crossing of class variables)
      • The table in your file crosses one variable
        Country
    • The data portion (the cells where the dimensions intersect)
      • Your table contains count totals

    Proc IMPORT can read ranges of cells from an Excel file and this gives rise to the following strategy for unpivoting the pivot table:

    • IMPORT the column header part,
    • IMPORT the row header and data portion together. This can be done because the row header is one dimensional.
    • TRANSPOSE the column header part and the row header + data part and merge into a final categorical form.
      • The merging will associate a categorical (or class) values (country, year, quarter, and month) to each data cell.

    IMPORTING with GETNAMES=NO will produce a raw starting point where each column from Excel will be named F<nn>. When the raw data is transposed you will have a new column named _NAME_ whose values will be "F<nn>". The column names have just become data.

    Looking across the cells in the data portion, each row has 98 columns that would become 98 rows when transposed. However, the totals columns can be discarded, and you will be left with 72 rows (in transpose) corresponding to 6 years * 12 months/year.

    The original report can be reproduced using Proc TABULATE once you have transformed the IMPORTs into categorical time.

    Example:

    The values of the _NAME_ variable are mapped to a corresponding month using a custom informat built from the column header transpose.

    * read column headers;
    
    PROC IMPORT DATAFILE="C:\Users\Richard\Downloads\IRCC_M_PRadmiss_0002_E (1).xls"
      REPLACE OUT=headers_raw
      ;
      RANGE="A3:CT5";
      GETNAMES=NO;
    RUN;
    
    * read row header and data part (the counts);
    
    PROC IMPORT DATAFILE="C:\Users\Richard\Downloads\IRCC_M_PRadmiss_0002_E (1).xls"
      REPLACE OUT=data_raw
      ;
      RANGE="A6:CT204";
      GETNAMES=NO;
    RUN;
    
    proc transpose data=headers_raw 
      out=headers_tall(
        drop=_label_
        where=(col3 not contains 'Total' and col2 not contains 'Total')
      );
      var F2-F98;
    run;
    
    * construct special CNTLIN data set for custom informat named FIELD_TO_MONTH;
    
    data headers_cntlin;
      length start $32 label 8;
    
      if _n_ = 1 then do;
        hlo = 'O';
        label = .;
        output;
        hlo = '';
      end;
    
      set headers_tall;
      retain year;
      if not missing (col1) then year = input(col1,4.);
    
      retain fmtname 'FIELD_TO_MONTH' type 'i';
    
      month = input (cats('01',col3,year), date9.);
    
      start = _name_;
      label = month;
      output;
    
      keep fmtname start label type hlo;
    run;
    
    * build the custom informat;
    
    proc format cntlin=headers_cntlin;
    run;
    
    * F1 is the one-dimensional row header;
    * F2-F98 are the data cells;
    * after transpose the column names will be data in column _NAME_;
    
    proc transpose data=data_raw
      out=data_tall(
        drop=_label_ 
        rename=(F1=Country col1=count)
        index=(_name_)
      );
      by F1 notsorted;
      var F2-F98;
    run;
    
    data immigration;
      set data_tall;
      label Country=' ';
    
      * map field name to month per column headers;
    
      month = input(_name_,FIELD_TO_MONTH.);
      if not missing(month);
    
      * variables for wider temporal aggregations in time hierarchy year/qtr/month;
    
      year = year(month);  
      qtr  = qtr(month);   format qtr NLSTRQTR2.1;
    
      format month monname3.;
    
      keep country year qtr month count;
    run;
    
    * reproduce the Excel pivot table in SAS;
    
    ods html file='immigration.html' style=plateau;
    
    proc tabulate data=immigration;
      class year qtr month / order=internal;
      class country;
      var count;
    
      table 
        country
        ,
        ( year=''
        * ( qtr=''
          * ( month='' 
              all
            )
            all
           )
           all
        )
        * count='' * sum=' ' * f=comma12.
        ;
    run;
    
    ods html close;
    

    Headers raw (from range)

    enter image description here

    Data raw (from range)

    enter image description here

    Headers tall

    enter image description here

    Headers CNTLIN for custom informat FIELD_TO_MONTH

    enter image description here

    Data tall (72 rows per country for 6 years and 12 months / year)

    enter image description here

    Immigration after applying informat to field names (F2....) to compute variables for temporal hierarchy

    enter image description here

    Original table reproduced by TABULATE

    enter image description here