Search code examples
rdataframeftpread.tabledata-extraction

Creating data frame with dates and data from a text files


I want to create a data frame from a files. First I have downloaded the ftp files in to my pc from internet. Files can be found in this link: data from ftp. I want to create a data frame with date (taken from the Launch date line of each file) on one column and data on other columns as well.
For grabbing all files I used the following code:

`setwd("C:/Users/")
path = "~C:/Users/"
files <- lapply(list.files(pattern = '\\.l100'), readLines) 
test.sample<-do.call(rbind, lapply(files, function(lines){
  # for each file, return a data.frame of the datetime, pulled with regex
  data.frame(datetime = lubridate::dmy(sub('^.*Launch Date*    : ', '', lines[6])),
             # and the data, read in as text
             read.table(text = lines[13:length(lines)]))
}))`

I am wondering what is wrong with the above code. If you can write a new code would be great too. Thank you in advance.

The data looks like this:

National Oceanic and Atmospheric Administration, U.S. Department of Commerce
Station        : Pago Pago, American Samoa
Station Height : 5 meters
Latitude       : -14.33
Longitude      : -170.71
Flight Number  : ASA016
Launch Date    : 18 July 1986
Launch Time    : 02:31:00 GMT
Radiosonde Type: 
Radiosonde Num : 
O3 Sonde ID    : 4A1834   Background: 0.050 microamps (0.20 mPa)  Flowrate: 29.90 sec/100ml  RH Corr: 2.00 %
Sonde Total O3 : 268   (51) DU      Sonde Total O3 (SBUV): 251 (35) DU

Level   Press    Alt   Pottp   Temp   FtempV   Hum  Ozone  Ozone   Ozone  Ptemp  O3 # DN O3 Res
 Num     hPa      km     K      C       C       %    mPa    ppmv   atmcm    C   10^11/cc   DU
   0  1007.7   0.005   304.0   31.5    26.0     73   1.63  0.016  0.0000   38.4    3.883    268
   1  9999.9   0.100  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
   2   982.0   0.200   301.2   26.5    21.4     74   1.69  0.017  0.0003   38.4    4.096    267
   3  9999.9   0.300  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
   4   960.0   0.400   301.2   24.6    20.6     79   1.64  0.017  0.0006   38.4    3.983    267
   5  9999.9   0.500  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
   6   946.0   0.600   301.1   23.2    20.2     83   1.65  0.017  0.0008   38.4    4.021    267
   7   932.0   0.700   300.9   21.8    19.9     89   1.65  0.018  0.0010   38.5    4.040    267
   8   920.9   0.800   301.1   21.0    16.4     75   1.65  0.018  0.0012   38.5    4.051    266
   9  9999.9   0.900  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
  10  9999.9   1.000  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
  11   893.0   1.100   302.0   19.2    12.1     64   1.58  0.018  0.0016   38.5    3.923    266
  12  9999.9   1.200  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999

I have many files like this. Thank you again.


Solution

  • Though I like the metadata and somewhat self-documentation present in this fixed-width table file, it can be interesting to import them programmatically. Excel happens to do alright since it doesn't complain about lines that don't match the pattern, but that doesn't help import numerous files.

    So we need something a little home-baked. I only downloaded two of the files, but this should work equally well with many more. (I'm assuming the format is consistent, otherwise we'll need to fix some of the magic constants such as 15 (rows to ignore) and 29 (rows that include the table headers).)

    (files <- list.files(pattern = "asa*"))
    # [1] "asa001_1986_04_01_20.l100" "asa002_1986_04_14_18.l100"
    

    This is the workhorse. Instead of "just" taking the date as you requested, it's about the same effort to grab all the metadata headers. They made it a little problematic by putting multiples on some lines (and some labels with a space), but I think I found a suitable pattern to find them.

    alldfs <- lapply(files, function(f) {
      # read in all but the data; two MAGIC CONSTANTS
      txt <- readLines(f, n = 29)[-(1:15)]
      i <- which(! nzchar(trimws(txt)))
      hdrs <- sapply(strsplit(txt[1:(i-1)], ": "), trimws)
      hdrsdf <- as.data.frame(Reduce(c, lapply(hdrs, function(hdr) {
        z <- strsplit(hdr, "  ")
        if (length(z) == 1) {
          setNames(NA_character_, nm = z[[1]])
        } else {
          setNames(lapply(z[-1], head, n=1), nm = trimws(sapply(z[-length(z)], tail, n=1)))
        }
      })), stringsAsFactors = FALSE)
      df <- as.data.frame(lapply(read.table(f, skip=27, stringsAsFactors=F)[-1,],
                                 as.numeric))
      cbind.data.frame(hdrsdf, df)
    })
    length(alldfs)
    # [1] 2
    

    The result is a wide data.frame. The first 13 columns are constant, the metadata headers. If you only want Launch.Date, feel free to drop the rest.

    str(alldfs[[1]])
    # 'data.frame': 361 obs. of  27 variables:
    #  $ Station              : chr  "Pago Pago, American Samoa" "Pago Pago, American Samoa" "Pago Pago, American Samoa" "Pago Pago, American Samoa" ...
    #  $ Station.Height       : chr  "5 meters" "5 meters" "5 meters" "5 meters" ...
    #  $ Latitude             : chr  "-14.33" "-14.33" "-14.33" "-14.33" ...
    #  $ Longitude            : chr  "-170.71" "-170.71" "-170.71" "-170.71" ...
    #  $ Flight.Number        : chr  "ASA001" "ASA001" "ASA001" "ASA001" ...
    #  $ Launch.Date          : chr  "01 April 1986" "01 April 1986" "01 April 1986" "01 April 1986" ...
    #  $ Launch.Time          : chr  "20:40:00 GMT" "20:40:00 GMT" "20:40:00 GMT" "20:40:00 GMT" ...
    #  $ Radiosonde.Type      : chr  NA NA NA NA ...
    #  $ Radiosonde.Num       : chr  NA NA NA NA ...
    #  $ O3.Sonde.ID          : chr  "4A1911" "4A1911" "4A1911" "4A1911" ...
    #  $ Background           : chr  "0.050 microamps (0.19 mPa)" "0.050 microamps (0.19 mPa)" "0.050 microamps (0.19 mPa)" "0.050 microamps (0.19 mPa)" ...
    #  $ Flowrate             : chr  "29.10 sec/100ml" "29.10 sec/100ml" "29.10 sec/100ml" "29.10 sec/100ml" ...
    #  $ RH.Corr              : chr  "2.00 %" "2.00 %" "2.00 %" "2.00 %" ...
    #  $ Sonde.Total.O3       : chr  "242" "242" "242" "242" ...
    #  $ Sonde.Total.O3..SBUV.: chr  "233 (36) DU" "233 (36) DU" "233 (36) DU" "233 (36) DU" ...
    #  $ Level                : num  1010 10000 10000 10000 971 ...
    #  $ Press                : num  0.005 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ...
    #  $ Alt                  : num  302 10000 10000 10000 302 ...
    #  $ Pottp                : num  30.2 999.9 999.9 999.9 25.8 ...
    #  $ Temp                 : num  24.6 999.9 999.9 999.9 20.7 ...
    #  $ FtempV               : num  72 999 999 999 74 999 999 999 75 999 ...
    #  $ Hum                  : num  1.65 99.9 99.9 99.9 1.68 99.9 99.9 99.9 1.48 99.9 ...
    #  $ Ozone                : num  0.016 99.999 99.999 99.999 0.017 ...
    #  $ Ozone.1              : num  0e+00 1e+02 1e+02 1e+02 5e-04 ...
    #  $ Ozone.2              : num  38.1 999.9 999.9 999.9 38.1 ...
    #  $ Ptemp                : num  3.93 1000 1000 1000 4.07 ...
    #  $ O3                   : num  242 9999 9999 9999 241 ...