I am trying to import my data with fixed column widths. After the first special character appears, the number of characters per column changes. What could be the reason for this? File: https://gist.github.com/1902r/0b596431e15bde833e9d9a0640e12ba7
library(readr)
columns_widths <- c(3, 8, 7)
source_data <- "data.rpt"
source_data_raw <- read_fwf(source_data, fwf_widths(columns_widths))
I have tried to read in the data with fread(). But there are no arguments in support of fixed column widths. I have blanks in my string so this doesn't work.
The following code should solve your problem using utils::read.fwf()
from base R. In the case when the (unquoted) character strings can contain blanks themselves (e.g. "Jo hn") you need this workaround. Otherwise data.table::fread()
without any arguments should be fine.
UTF-8-BOM encodings should not be used (if possible). See https://cran.r-project.org/doc/manuals/r-patched/R-data.pdf and https://en.wikipedia.org/wiki/Byte_order_mark
In readr::read_fwf()
column positions in fixed width files refer to bytes, not characters (see https://github.com/tidyverse/readr/issues/852). Therefore the row containing multibyte characters is displaced.
library(readr)
columns_widths <- c(3, 8, 7)
source_data <- "https://gist.githubusercontent.com/1902r/0b596431e15bde833e9d9a0640e12ba7/raw/215e50d1db79b4a7aeb3560680a55bba8c9f1503/data.rpt"
source_data_raw <- read_fwf(source_data, fwf_widths(columns_widths))
#> Rows: 4 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#>
#> chr (3): X1, X2, X3
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
source_data_raw
#> # A tibble: 4 × 3
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 ID Name Amount
#> 2 1 Jo hn 100
#> 3 2 Bílušk a 450
#> 4 3 Jane 200
## first read colnames from file
datar_names <- read.fwf(source_data,
widths = columns_widths,
n = 1, fileEncoding = "UTF-8-BOM",
strip.white = TRUE
)
## read data using names from above
datar <- read.fwf(source_data,
widths = columns_widths,
skip = 1, col.names = datar_names,
fileEncoding = "UTF-8-BOM",
strip.white = TRUE
)
datar
#> ID Name Amount
#> 1 1 Jo hn 100
#> 2 2 Bíluška 450
#> 3 3 Jane 200
str(datar)
#> 'data.frame': 3 obs. of 3 variables:
#> $ ID : int 1 2 3
#> $ Name : chr "Jo hn" "Bíluška" "Jane"
#> $ Amount: int 100 450 200
Created on 2024-04-26 with reprex v2.1.0