Search code examples
rdataframedata-extraction

Extract the date and values to create a data frame


Here are three sample data files (text files) from my folder. I tried to extract the date corresponding to each data file and create a data frame for all the three files. Following are the files:

File1:

### DEFAULTS ###
### OPTIONS ###
Title 2 left      $ Date and time: 2010-02-03 12:00 UTC
###  DATA  ### 
250 23.54300 0.00000 12.90000
500 17.47400 3.50000 21.70000
750 41.33200 22.10000 30.40000

File2:

### DEFAULTS ###
### OPTIONS ###
Title 2 left      $ Date and time: 2010-02-10 12:00 UTC 
###  DATA  ###
250 36.95300 30.60000 27.10000
500 40.87700 37.80000 27.80000
750 41.46100 38.30000 30.70000

File3:

### DEFAULTS ###
### OPTIONS ###
Title 2 left      $ Date and time: 2010-02-17 12:00 UTC 
###  DATA  ###
250 28.91200 24.90000 5.00000
500 49.82900 32.40000 11.10000
750 50.83600 40.60000 22.20000

Below is the code I wrote for the extracting the date.

### Start R Code for extracting date
setwd("C:/Users/")
path = "~C:/Users"
file.names<- dir("C:/Users/", pattern =".dat")
file.names

fileNameVector <- NULL
dateNameVector <- NULL

for(i in 1:length(file.names)){

  x<-readLines(file.names[i])
  x
  date.list<-x[3]
  date.list

  xx<-strsplit(date.list,' ')
  xx

  date.fin<-unlist(xx)[13]

  fileNameVector <- rbind(fileNameVector, file.names[i])
  dateNameVector <- rbind(dateNameVector, date.fin)

}

finalDateName <- cbind(fileNameVector, dateNameVector)
finalDateName

and for extracting values:

setwd("C:/Users")
path = "~C:/Users"
st021ozone<-""
file.names<- dir("C:/Users", pattern =".dat")
for(i in 1:length(file.names)){
  file <- read.table(file.names[i],skip=4,header=FALSE, sep=" ", stringsAsFactors=FALSE)
  st021ozone <- rbind(st021ozone, file)
}
write.table(st021ozone, file = "st021ozone",sep=",", 
            row.names = FALSE, qmethod = "double",fileEncoding="windows-1252")

I want to join this two codes and make a complete data frame where the date is on the left side of the table corresponding to the 250 value in each file. The final data frame will be will have 9 rows and 5 columns.

Here is the desired result:

2010-02-03 250 23.54300 0.00000 12.90000
           500 17.47400 3.50000 21.70000
           750 41.33200 22.10000 30.40000
2010-02-10 250 36.95300 30.60000 27.10000
           500 40.87700 37.80000 27.80000
           750 41.46100 38.30000 30.70000
2010-02-17 250 28.91200 24.90000 5.00000
           500 49.82900 32.40000 11.10000
           750 50.83600 40.60000 22.20000

Thank you in advance for the help.

Edit: Here is one sample of the file. I have 53 similar files:

$
### DEFAULTS ###
Color scale       $ 
Contour levels    $ 
Contour colors    $ 
Contour type      $ 
Map limits        $ 
Map projection    $ 
### OPTIONS ###
Image order       $ 1
Title 1 left      $ Case 0236-004 - Vertical 
Vertical 
Title 2 left      $ Date and time: 2010-09-08 
Receptor code: Town021
Title 3 left      $ 
Title 4 left      $ 
Title 1 right     $ 
Title 2 right     $ tranport
Title 3 right     $ Start: 2010-01-01 00:00 UTC
Title 4 right     $   
gar-0236-004-01-20151103085553.png
###  DATA  ###
1 1 0 1 1.83400 32.00000 0.00000 21.20000
1 1 100 1 3.27300 0.00000 25.10000 21.90000
1 1 250 1 12.22200 0.00000 34.30000 25.60000
1 1 500 1 27.27400 0.00000 35.00000 31.30000
1 1 750 1 26.45300 0.00000 36.10000 35.90000
1 1 1000 1 32.62200 0.00000 36.40000 39.30000
1 1 1500 1 35.22700 0.00000 36.70000 42.20000
1 1 2000 1 37.90300 0.00000 37.40000 43.80000
1 1 3000 1 47.28200 0.00000 44.30000 46.90000
1 1 4000 1 51.01200 0.00000 49.00000 49.90000
1 1 5000 1 61.06500 0.00000 49.40000 51.00000

@alistaire Thank you for the link.Here is the code that I tried to execute.

setwd("C:/Users/")
path = "~C:/Users/"
test.sample<-""
files <- lapply(list.files(pattern = '\\.dat'), readLines) 
test.sample<- rbind(test.sample, files)
do.call(rbind, lapply(files, function(lines){
  # for each file, return a data.frame of the datetime, pulled with regex
   data.frame(datetime = as.POSIXct(sub('^.*Date and time: ', '', lines[grep('Date and time:', lines)])),
             # and the data, read in as text
         read.table(text = lines[(grep('DATA', lines) + 1):length(lines)]))
  }))
write.table(test.sample, file = "test.sample", sep="\t", qmethod ="double",row.names = FALSE,fileEncoding="windows-1252")

I assigned a file name and gave a path. When I wrote the variable name 'test.sample' in to console, I got a matrix:see image for the result I think I am missing something??? NB. Here I am using a folder with 44 files.


Solution

  • These cases are mostly a matter of hacking through the mess, though some patterns make life easier. Since the files are at least regular, you don't need too much here to get something useful:

    # read files into a list
    files <- lapply(paste0('File', 1:3, '.dat'), readLines)
    
    # loop across files (into list) with `lapply`, recombine list with `do.call(rbind, ...`
    do.call(rbind, lapply(files, function(lines){
        # for each file, return a data.frame of the datetime, pulled with regex
        data.frame(datetime = as.POSIXct(sub('^.*Date and time: ', '', lines[3])),
                   # and the data, read in as text
                   read.table(text = lines[5:length(lines)]))
    }))
    
    #              datetime  V1     V2   V3   V4
    # 1 2010-02-03 12:00:00 250 23.543  0.0 12.9
    # 2 2010-02-03 12:00:00 500 17.474  3.5 21.7
    # 3 2010-02-03 12:00:00 750 41.332 22.1 30.4
    # 4 2010-02-10 12:00:00 250 36.953 30.6 27.1
    # 5 2010-02-10 12:00:00 500 40.877 37.8 27.8
    # 6 2010-02-10 12:00:00 750 41.461 38.3 30.7
    # 7 2010-02-17 12:00:00 250 28.912 24.9  5.0
    # 8 2010-02-17 12:00:00 500 49.829 32.4 11.1
    # 9 2010-02-17 12:00:00 750 50.836 40.6 22.2
    

    Edit as you see fit.


    Edit

    For the new file structure, now more independent of line numbers:

    do.call(rbind, lapply(files, function(lines){
        # for each file, return a data.frame of the datetime, pulled with regex
        data.frame(datetime = as.POSIXct(sub('^.*Date and time: ', '', lines[grep('Date and time:', lines)])),
                   # and the data, read in as text
                   read.table(text = lines[(grep('DATA', lines) + 1):length(lines)]))
    }))
    
    #      datetime V1 V2   V3 V4     V5 V6   V7   V8
    # 1  2010-09-08  1  1    0  1  1.834 32  0.0 21.2
    # 2  2010-09-08  1  1  100  1  3.273  0 25.1 21.9
    # 3  2010-09-08  1  1  250  1 12.222  0 34.3 25.6
    # 4  2010-09-08  1  1  500  1 27.274  0 35.0 31.3
    # 5  2010-09-08  1  1  750  1 26.453  0 36.1 35.9
    # 6  2010-09-08  1  1 1000  1 32.622  0 36.4 39.3
    # 7  2010-09-08  1  1 1500  1 35.227  0 36.7 42.2
    # 8  2010-09-08  1  1 2000  1 37.903  0 37.4 43.8
    # 9  2010-09-08  1  1 3000  1 47.282  0 44.3 46.9
    # 10 2010-09-08  1  1 4000  1 51.012  0 49.0 49.9
    # 11 2010-09-08  1  1 5000  1 61.065  0 49.4 51.0