I am trying to merge >30,000 individual .dat files of climate data across the UK (with sequential names of met*****.dat) into one single data file that I can then extract annual values from. The data is formatted such that the first row in each file contains the lat long for that cell and then underneath that are the climate variables measured:
54.78 -7.06
iGrid x y year month SRad Tmin Tmax vp wind rain sunshine
19028 56 139 1971 1 2.22 1.25 6.43 7.29 12.33 133.47 1.72
19028 56 139 1971 2 3.94 1.26 6.74 7.36 11.11 106.64 1.85
19028 56 139 1971 3 7.30 0.95 7.56 7.05 12.77 57.26 2.57
19028 56 139 1971 4 12.09 2.29 10.07 7.71 11.02 112.81 4.24
19028 56 139 1971 5 15.51 4.85 13.07 9.20 13.94 103.08 5.02
19028 56 139 1971 6 16.18 6.36 14.09 9.98 12.44 115.03 4.76
19028 56 139 1971 7 15.59 9.25 17.56 12.83 10.45 107.18 4.73
19028 56 139 1971 8 11.92 9.20 15.98 12.70 11.39 136.40 3.35
19028 56 139 1971 9 9.69 7.85 16.31 12.72 12.24 41.76 4.11
19028 56 139 1971 10 5.43 6.96 12.88 10.56 16.55 99.72 2.90
19028 56 139 1971 11 2.55 2.78 8.36 8.39 17.98 195.24 1.56
19028 56 139 1971 12 1.51 3.29 8.02 8.68 16.74 54.63 0.89
19028 56 139 1972 1 1.96 -0.30 4.79 6.78 16.55 168.39 1.07
19028 56 139 1972 2 4.23 0.29 5.78 6.97 15.97 114.47 2.33
19028 56 139 1972 3 7.80 0.53 7.54 7.71 14.02 165.09 3.14
19028 56 139 1972 4 12.51 2.80 10.32 8.26 13.39 116.30 4.63
19028 56 139 1972 5 14.49 4.46 11.50 8.98 15.54 187.31 4.17
19028 56 139 1972 6 16.95 5.36 12.90 9.50 12.45 123.17 5.39
19028 56 139 1972 7 15.35 8.61 16.81 12.63 7.41 105.49 4.53
19028 56 139 1972 8 11.40 8.04 15.17 11.66 11.85 86.44 2.88
19028 56 139 1972 9 9.60 5.54 14.03 10.19 9.96 24.70 4.01
19028 56 139 1972 10 5.29 6.04 12.06 10.27 12.92 79.35 2.70
19028 56 139 1972 11 2.49 1.74 7.09 8.02 14.79 169.47 1.43
19028 56 139 1972 12 1.55 1.80 7.45 8.00 17.66 153.31 1.01
19028 56 139 1973 1 1.73 2.16 6.32 7.98 15.25 202.89 0.52
19028 56 139 1973 2 4.08 0.27 5.47 7.17 14.71 126.04 2.07
19028 56 139 1973 3 8.05 0.90 8.30 7.24 13.64 49.84 3.43
19028 56 139 1973 4 12.54 1.98 9.05 7.27 13.02 86.45 4.66
19028 56 139 1973 5 14.55 4.71 12.37 9.41 13.18 102.20 4.22
19028 56 139 1973 6 16.05 7.42 16.17 11.17 10.93 64.45 4.65
19028 56 139 1973 7 13.72 9.66 16.39 12.82 8.16 111.27 3.18
19028 56 139 1973 8 12.23 9.62 17.07 12.93 10.37 97.53 3.62
19028 56 139 1973 9 9.11 7.95 15.02 11.65 12.63 116.15 3.49
19028 56 139 1973 10 5.23 4.51 10.78 9.45 10.27 71.68 2.61
19028 56 139 1973 11 2.37 1.84 7.63 7.91 14.48 162.53 1.17
19028 56 139 1973 12 1.56 1.19 6.79 7.72 14.65 174.89 1.03
Reading previous answers (This and this) has helped me arrive at this code:
library(data.table)
filenames <- list.files(path="/mydirectory",
pattern = ".dat", full.names=T)
temp <- lapply(filenames, fread, sep=",")
data <- rbindlist(temp)
write.table(data,"metcombi.dat", sep=",")
Which spits out a huge .dat file (the input files occupy ~2.5GB) but at the end of each file it repeats the column headings and erases the lat long values.
1) How do I stop it from repeating the column headers for each file I'm merging? 2) How can I extract the lat long values from the first row in each file and move them to new columns?
Thank you!
Consider expanding your lapply()
to do a quick read of first line, read in .dat file into dataframe, and then split lat/long first lines to columns:
filenames <- list.files(path="/mydirectory",
pattern=".dat", full.names=TRUE)
dfList <- lapply(filenames, function(f){
# FIRST LINES
con <- file(description=f, open="r")
latlong <- readLines(con, warn=FALSE)[1]
close(con)
# DATA FRAME
df <- read.table(f, skip=1, header=TRUE)
df$lat <- strsplit(latlong, "\\s+")[[1]][1]
df$lon <- strsplit(latlong, "\\s+")[[1]][2]
return(df)
})
data <- rbindlist(dfList)
write.csv(data, "metcombi.dat")