Search code examples
rsas

How to open a text (.txt) file from SAS in R?


I have a text (.txt) file that apparently was created on SAS. I have no idea how SAS works, but I need to open the text file in R so it reads as a dataframe with variables names and all.

The file can be downloaded here: https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=pense/2009/microdados/ . It's a zip file called "PENSE_2009_microdados.zip". There are 3 folders:

  1. Dados - with "Base_Pense Completa_2009.txt"
  2. Documentação - with the data dictionary
  3. Input SAS - with a .csv file that I could not understand
  4. A "read me" file that explains that the "Dados" folder contains the data and the "Documentação" folder contains "SAS's layout and reading input".

Does anybody know how to "transform" this data into a dataframe that makes sense? Thanks in advance.


Solution

  • It's a fixed-width file, each line contains all variables in the documentation without spacing, and you provide the number of characters that make up each column.

    The input sas file (this should be a .sas, not sure why it would be excel format) tells you the variable names and their formats/widths.

    You can also get this from the documentation excel file plus the format of factor variables, eg, B01P02 has 5 unique values coded 1 thru 5.

    doc <- readxl::read_excel('PENSE_2009_microdados/Documenta‡ֶo/Dicionario_Base Pense_Completa.xls')
    ## the widths for each column
    wid <- doc$TÉRMINO - doc$INICIO + 1
    
    dat <- read.fwf(
      'PENSE_2009_microdados/Dados/Base_Pense Completa_2009.txt',
      ## also add variable names
      widths = wid, col.names = doc$CAMPO,
      # n = 1000,
      ## some options for importing NAs/trimming white space
      na.strings = c('NA', '.', ''), strip.white = TRUE
    )
    
    ## add labels if wanted
    dat <- Hmisc::upData(dat, labels = setNames(doc$DESCRICAO, doc$CAMPO))
    
    str(dat)
    

    Here is how you can use the formats. They are not standardized, so not all will be fixed this way:

    x <- doc$DOMÍNIO[doc$CAMPO %in% 'B01P02']
    # [1] "1- Branca\n 2- Preta(Negra)\n 3- Parda(Mulata) \n 4- Amarela(Oriental) \n 5- Indígena"
    
    fmt <- read.table(text = x, sep = '-', strip.white = TRUE)
    x_fmt <- factor(dat$B01P02, fmt$V1, fmt$V2)
    
    table(old = dat$B01P02, new = x_fmt)
    #    new
    # old Branca Preta(Negra) Parda(Mulata) Amarela(Oriental) Indígena
    #   1    309            0             0                 0        0
    #   2      0          167             0                 0        0
    #   3      0            0           436                 0        0
    #   4      0            0             0                27        0
    #   5      0            0             0                 0       30