Search code examples
rspecial-charactersreadrread-fwf

Problem when importing into R with fixed column widths


I am trying to import my data with fixed column widths. After the first special character appears, the number of characters per column changes. What could be the reason for this? File: https://gist.github.com/1902r/0b596431e15bde833e9d9a0640e12ba7

library(readr)
columns_widths <- c(3, 8, 7)
source_data <- "data.rpt"
source_data_raw <- read_fwf(source_data, fwf_widths(columns_widths))

I have tried to read in the data with fread(). But there are no arguments in support of fixed column widths. I have blanks in my string so this doesn't work.


Solution

  • The following code should solve your problem using utils::read.fwf() from base R. In the case when the (unquoted) character strings can contain blanks themselves (e.g. "Jo hn") you need this workaround. Otherwise data.table::fread() without any arguments should be fine.

    UTF-8-BOM encodings should not be used (if possible). See https://cran.r-project.org/doc/manuals/r-patched/R-data.pdf and https://en.wikipedia.org/wiki/Byte_order_mark

    In readr::read_fwf() column positions in fixed width files refer to bytes, not characters (see https://github.com/tidyverse/readr/issues/852). Therefore the row containing multibyte characters is displaced.

    library(readr)
    columns_widths <- c(3, 8, 7)
    source_data <- "https://gist.githubusercontent.com/1902r/0b596431e15bde833e9d9a0640e12ba7/raw/215e50d1db79b4a7aeb3560680a55bba8c9f1503/data.rpt"
    source_data_raw <- read_fwf(source_data, fwf_widths(columns_widths))
    #> Rows: 4 Columns: 3
    #> ── Column specification ────────────────────────────────────────────────────────
    #> 
    #> chr (3): X1, X2, X3
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    source_data_raw
    #> # A tibble: 4 × 3
    #>   X1    X2     X3    
    #>   <chr> <chr>  <chr> 
    #> 1 ID    Name   Amount
    #> 2 1     Jo hn  100   
    #> 3 2     Bílušk a 450 
    #> 4 3     Jane   200
    
    ## first read colnames from file
    datar_names <- read.fwf(source_data,
      widths = columns_widths,
      n = 1, fileEncoding = "UTF-8-BOM",
      strip.white = TRUE
    )
    
    ## read data using names from above
    datar <- read.fwf(source_data,
      widths = columns_widths,
      skip = 1, col.names = datar_names,
      fileEncoding = "UTF-8-BOM",
      strip.white = TRUE
    )
    datar
    #>   ID    Name Amount
    #> 1  1   Jo hn    100
    #> 2  2 Bíluška    450
    #> 3  3    Jane    200
    str(datar)
    #> 'data.frame':    3 obs. of  3 variables:
    #>  $ ID    : int  1 2 3
    #>  $ Name  : chr  "Jo hn" "Bíluška" "Jane"
    #>  $ Amount: int  100 450 200
    

    Created on 2024-04-26 with reprex v2.1.0