Search code examples
pythonpandasnumpytextsas

Python output fixed width format text file with special lines as SAS do


I have the sample data as below:

# df
VAR1   SEQ    VAR2    VAR3       DATE    VAR4     VAR5    VAR6    VAR7
AAA      1     YYY      01   20000630      AL    11111    ABCD      PA
BBB      1     YYY      01   20100701      GA    12345    EDED      NY
BBB      2     YYY      01   20150815      GA    12345              NY
BBB      3     YYY      01   19950105      GA    12345    YTRU      NY
BBB      4     YYY      01   20000701      GA    12345    IIII      NY
BBB      5     YYY      01   20210701      GA    12345              NY
CCC      1     NNN      01   20210630      CA    33333    SSSS      NJ
CCC      2     NNN      01   20210629      CA    33333              NJ

In SAS, we can export fixed width format file as below:

BLANK_VAR1 = " "

%MACRO FRIST;
    PUT @  1  "00FIRST"
        @  8  VAR1       $CHAR5.
        @ 13  BLANK_VAR1 $CHAR2.
        @ 15  VAR2       $CHAR3.
    ;
%MEND FRIST;

%MACRO SECOND;
    PUT @  1  "00SECOND"
        @  9  VAR3       $CHAR2.
        @ 11  BLANK_VAR1 $CHAR2.
        @ 13  VAR4       $CHAR2.
        @ 15  VAR5       $CHAR5.
    ;
%MEND SECOND;

%MACRO THIRD(sequence);
    num = &sequence.;
    PUT @  1  num        Z2.0
        @  3  "THIRD"    $CHAR5.
        @  8  DATE       $CHAR8.
    ;
%MEND THIRD;

%MACRO FOURTH(sequence);
    num = &sequence.;
    PUT @  1  num        Z2.0
        @  3  "FOURTH"   $CHAR5.
        @  9  VAR6       $CHAR25.
        @ 34  BLANK_VAR1 $CHAR2.
        @ 36  VAR7       $CHAR2.
    ;
%MEND FOURTH;

filename outtmp "/home/folder/outfile_tmp";  

DATA _NULL_;                                                                    
   SET df;    
 
   BY VAR1 SEQ;                                                           
   FILE outtmp;  
                                                                
   IF FIRST.VAR1 THEN DO;
      %FRIST;
      %SECOND;
      REC_CNT = 0;                                                             
   END;                                                                       
   REC_CNT + 1;
   IF REC_CNT LE 3 THEN DO;
      %THIRD(REC_CNT);
      IF VAR6 NE ' ' THEN DO;
         %FOURTH(COUNTN);
      END;
   END;                                                         
RUN;


filename output "/home/folder/output"; 


%MACRO INREC;                                                                     
   PUT 001 RECIN $CHAR150.;
%MEND INREC;


%MACRO FILE_FIRST;
    DATE = TODAY();
    PUT @  1  "###FIRSTLINE###"
        @ 16  DATE       JULIAN5.
        @ 21  BLANK_VAR1 $CHAR2.
        @ 23  "###FIRSTLINEEND###"
    ;
%MEND FILE_FIRST;


%MACRO FILE_LAST;
    DATE = TODAY();
    PUT @  1  "###LASTLINE###"
        @ 15  DATE       JULIAN5.
        @ 20  BLANK_VAR1 $CHAR2.
        @ 22  "###LASTLINEEND###"
    ;
%MEND FILE_LAST;


DATA output;                                                                  
   INFILE outtmp truncover;                                                               
   INPUT                                                                       
      @ 001 RECIN $CHAR150.;                                                                         
RUN; 

DATA _NULL_;
   SET output  end=last;  
   file output  lrecl=256 ;
   IF _N_ = 1 THEN DO;
      %FILE_FIRST;
   END;

   %INREC;  
                                                                  
   IF last THEN DO;
      %FILE_LAST; 
   END;
RUN; 

This is the output:
enter image description here

###FIRSTLINE###21182  ###LASTLINEEND###
00FIRSTAAA  YYY
00SECOND01  AL11111
01THIRD20000630
01FOURTHABCD                       PA
00FIRSTBBB  YYY
00SECOND01  GA12345
01THIRD20100701
01FOURTHEDED                       NY
02THIRD20150815
03THIRD19950105
03FOURTHYTRU                       NY
00FIRSTCCC  NNN
00SECOND01  CA33333
01THIRD20210630
01FOURTHSSSS                       NJ
###LASTLINE###21182  ###LASTLINEEND###

The logic for above program is:

  1. There are four parts that needed to be output.
  2. If there are multiple same VAR1, then only output FIRST and SECOND once.
  3. Output THIRD part for SEQ is less than 3. If the SEQ is larger than 3, do not output. Ignore.
  4. Output the FOURTH part following the third logic and also if VAR6 is not missing.
  5. Note: In THIRD and FOURTH part, the first two string should change from 01 to 03 depends on the records.

How can I replicate this format in Python?
I found that np.savetxt() with fmt argument might be a way link; however, the file should be the same order as original dataframe.

pandas has function read_fwf() to read fixed width format file; however, no to_fwf() function to export.

I have been stuck for several days, so any idea should be helpful!


Solution

  • This isn't exactly a good way to do it, but maybe it gives you an idea how to do the logic. I'm just writing to a list, you can then write the list out - but probably you should do it the way JonSG did in his (deleted) answer where you use a file writer instead. There's probably a better approach using a data class, but that's not my expertise.

    import pandas as pd
    
    df = pd.read_csv(r"h:\temp\df_text.csv")
    
    outlist = []
    
    for index,row in df.iterrows():
        if(row['SEQ']==1):
            tempstr = '00FIRST'+row.VAR1+'  '+row.VAR2
            outlist.append(tempstr)
            tempstr = '00SECOND'+str(row.VAR3)+'  '+str(row.VAR4)+str(row.VAR5)
            outlist.append(tempstr)
        if(row['SEQ'] <= 3):
            seqval ='0'+str(row.SEQ) if  row.SEQ < 10 else str(row.SEQ)
            tempstr = str(row['SEQ'])+'THIRD'+str(row.DATE)
            outlist.append(tempstr)
            if (row.VAR6 != '  '):
                tempstr = str(row['SEQ'])+'FOURTH'+row.VAR6+'  '+row.VAR7
                outlist.append(tempstr)