Search code examples
rfixed-widthread.fwf

Reading fixed width format data into R with entries exceeding column width


I need to use the Annual Building Permits by Metropolitan Area Data distributed by the US Census Bureau, which are downloadable here as fixed width format text files. Here is an excerpt of the file (I've stripped the column names as they aren't in a nice format and can be replaced after reading the file into a date frame):

999 10180 Abilene, TX                             306     298       8       0       0       0
184 10420 Akron, OH                               909     905       0       4       0       0
999 13980 Blacksburg-Christiansburg-Radford,
  VA                                              543     455       0       4      84       3
145 14010 Bloomington, IL                         342     214       4       0     124       7
160 15380 Buffalo-Cheektowaga-Niagara Falls,*
  NY                                             1964     931      14      14    1005      68
268 15500 Burlington, NC                         1353     938      12      16     387      20

As seen in the above excerpt, many of the entries in the Name column exceed the width of the column (which looks to be 36 characters). I've experimented with the various fwf reading functions of both the utils package and readr but can't find a solution that takes these entries into account. Any tips would be much appreciated.


Edit: The original file excerpt was edited by a mod for formatting and in the process the example entries where the third column width was exceeded were deleted. I've since updated the excerpt to reinclude them and have stripped the column names.

I ran @markdly 's code, which was submitted before this edit, works for all the entries that don't have this issue. I exported the result to a csv, and included an excerpt below to show what happens with these entries:

"38","999",NA,"13980",NA,"Blacksburg-Christiansburg-Radford,",NA,NA,NA,NA,NA,NA
"39","V","A",NA,NA,NA,"543",455,0,4,84,3
"40","145",NA,"14010",NA,"Bloomington, IL","342",214,4,0,124,7
"51","160",NA,"15380",NA,"Buffalo-Cheektowaga-Niagara Falls,*",NA,NA,NA,NA,NA,NA
"52","N","Y",NA,NA,NA,"1964",931,14,14,1005,68
"53","268",NA,"15500",NA,"Burlington, NC","1353",938,12,16,387,20

Edit 2: Most of the major metro areas I'm actually looking at don't fall into this problem category, so while it would be nice to have the data for the ones that do, if there is no workable solution, would there be a way to remove these entries from the data set altogether?


Solution

  • Edit:
    Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.

    Step 1 Read the file in and assign the split records a common record id

    library(tidyverse)
    df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
      rownames_to_column() %>%
      mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))
    
    df[37:40, ]
    #> # A tibble: 4 x 8
    #>   rowname                                                    X1    X2
    #>     <chr>                                                 <chr> <int>
    #> 1      37 999 13900 Bismarck, ND                            856   629
    #> 2      38          999 13980 Blacksburg-Christiansburg-Radford,    NA
    #> 3      39   VA                                              543   455
    #> 4      40 145 14010 Bloomington, IL                         342   214
    #> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
    #> #   record <chr>
    

    Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.

    df <- df %>%
      mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
      extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
      mutate(Name = trimws(Name)) %>%
      filter((lead(record) != record) | rowname == 1) %>%
      select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)
    
    df[37:39, ]
    #> # A tibble: 3 x 9
    #>     CSA  CBSA                                 Name Total    X2    X3    X4
    #>   <chr> <chr>                                <chr> <chr> <int> <int> <int>
    #> 1   999 13900                         Bismarck, ND   856   629    16     6
    #> 2   999 13980 Blacksburg-Christiansburg-Radford,VA   543   455     0     4
    #> 3   145 14010                      Bloomington, IL   342   214     4     0
    #> # ... with 2 more variables: X5 <int>, X6 <int>
    

    Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.

    Example data

    library(readr)
    
    # example data
    txt <- "                                                                                        Num of
                                                                                            Struc-
                                                                                            tures
                                                                                            With
                                                                          3 and 4  5 Units  5 Units
    CSA CBSA  Name                                   Total 1 Unit 2 Units   Units  or more  or more
    
    999 10180 Abilene, TX                             306     298       8       0       0       0
    184 10420 Akron, OH                               909     905       0       4       0       0" 
    
    write_file(txt, "example.txt")
    

    Solution

    col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
    col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2", 
                   "units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
    df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
    df
    #> # A tibble: 2 x 11
    #>     CSA blank_1  CBSA blank_2        Name Total units_1 units_2
    #>   <int>   <chr> <int>   <chr>       <chr> <int>   <int>   <int>
    #> 1   999    <NA> 10180    <NA> Abilene, TX   306     298       8
    #> 2   184    <NA> 10420    <NA>   Akron, OH   909     905       0
    #> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
    #> #   num_struc_5_or_more <int>