r regex text text-extraction data-extraction

Extracting a table (and other information) from a text file in R

I'm trying to use R to extract the data table - and a couple of other bits of information - from the historical Met Office data, but despite having spent all evening on StackOverflow keep running into problems.

For example, here's the data for sunny (maybe??) Lowestoft:

Lowestoft / Lowestoft Monckton Ave from Sept 2007
Location 654300E 294600N 25m amsl to July 2007 
& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by  ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
   yyyy  mm   tmax    tmin      af    rain     sun
              degC    degC    days      mm   hours
   1914   1    5.2     0.7    ---     52.0    ---
   1914   2    9.2     3.5    ---     28.0    ---
   1914   3   ---     ---     ---     ---     ---
   1914   4   12.9     5.3    ---     18.0    ---
   ...
   2020  11   12.5*    6.1*      0*   31.9*   73.7*  Provisional
   2020  12    7.7*    2.9*      6*  105.8*   50.5*  Provisional
   2021   1    5.8*    1.2*     10*   78.6*   49.4*  Provisional
   2021   2    7.9*    2.4*      9*   48.6*   84.7*  Provisional

The best I've managed so far is to use sed (outside of R) to remove the *'d and #'d variables, but importing this using read.table(lowestoftdata.text, skip = 8, col.names = c("year","month","max_temp", "min_temp", "frost", "rainfall", "sunshine")) falls over when it hits the data of 2020 onwards flagged as provisional. It would also be really handy to extract the latitude and longitude values, which are usually on line 2 but can be on line 3 if, like Lowestoft, the station moved at some point, but my very limited regex knowledge (and a moving target) is letting me down.

My pseudocode approach would be to:

Identify line with latitude and longitude, parse that line to extract those variables
Identify first line that starts with a number, and read.table from that line

... but converting this into practice is proving challenging with my limited experience handling anything other than a nicely formatted CSV file, so any advice on even where to start would be greatly appreciated.

Solution

Here is another go:

A bunch of diffferent things were necesary to clean this up.

First of all handling the 2-line header (those are always a pain). There might be simpler solutions for this, but at some point you just need to get the job done.

I merged the two lines to one, and used those slightly longer texts as headers.

The cleanup step before the data read is somewhat cryptic, but it strips anythinf from the end of the line that isn't a number, a dash or a star. (To trim those text comments which otherwise messes up the field parsing with fread, which is blazingly fast.)


library(data.table)
library(purrr)

raw.text <- read_file("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt")

lat.long <- as.numeric( str_match( raw.text, "Lat (\\d+\\.\\d+) Lon (\\d+\\.\\d+)" )[,-1] )

m <- regexpr( "+yyyy.*hours", raw.text )

headertext <- substr( raw.text, m, m+attr(m,"match.length")-1 )
header.lines <- strsplit( headertext, "\\r?\\n" )[[1]]
header.lines <- sub( "^\\s+", "", header.lines )
header.fields2 <- strsplit( header.lines, "\\s+" )
header.fields2[[2]] <- c( "", "", header.fields2[[2]] )

header.fields <- pmap_chr( header.fields2, paste, collapse=" " ) %>% str_trim

## some cleanup:
text.to.read <- substring( raw.text, m+attr(m,"match.length") )

## This next line matches anything that is not a digit (\\d) and not a dash (\\-) and not a star (\\*) until the end of the line, $. It's the enclosing (?m: ... ) that changes $ to match end of line, and not end of string as usual.
text.to.read2 <- gsub( "(?m:([^\\d\\-\\*]*)$)", "", text.to.read, perl=TRUE )

## by now a simple fread will do the rest for us
d <- fread( text=text.to.read2, fill=TRUE, header=FALSE, na="---" )
setnames(d, header.fields)

d

Output:


      yyyy mm tmax degC tmin degC af days rain mm sun hours
   1: 1914  1       5.2       0.7    <NA>    52.0      <NA>
   2: 1914  2       9.2       3.5    <NA>    28.0      <NA>
   3: 1914  3      <NA>      <NA>    <NA>    <NA>      <NA>
   4: 1914  4      12.9       5.3    <NA>    18.0      <NA>
   5: 1914  5      13.7       7.2    <NA>    38.0      <NA>
  ---                                                      
1270: 2020 10     14.2*      9.0*      0*   85.7*     58.8*
1271: 2020 11     12.5*      6.1*      0*   31.9*     73.7*
1272: 2020 12      7.7*      2.9*      6*  105.8*     50.5*
1273: 2021  1      5.8*      1.2*     10*   78.6*     49.4*
1274: 2021  2      7.9*      2.4*      9*   48.6*     84.7*