I am trying to import data from a large text file that does have a common delineator between the values. There is anywhere from 1 up to about 5 or so spaces between each of the values, but there are also single spaces within some of the values, which causes the values to be separated when using sep=""
.
I tried to copy and paste some of the data, but it is so large I am not sure how it will come across on this platform, so I have also included some pics of the text file.
Beginning of the data set (many more rows and columns):
Middle of the data set:
No CollarID UTC Date UTC Time LMT Date LMT Time Origin SCTS Date SCTS Time ECEF X ECEF Y ECEF Z Latitude Longitude Height DOP FixType 3D Error Sats Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Sat C/N Mort. Status Activity Main Beacon Temp Easting Northing AnimalID GroupID
m/d/yyyy h:mm:ss AMPM m/d/yyyy h:mm:ss AMPM m/d/yyyy h:mm:ss AMPM [m] [m] [m] [°] [°] [m] [m] used No db No db No db No db No db No db No db No db No db No db No db No db [V] [V] [°C] sskkkmmm.mmm ±kkkkmmm.mmm
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 13642 2/24/2020 5:00:38 PM 2/24/2020 5:00:38 PM Collar 9/1/2022 4:42:09 PM 3789035 911471 5032242 52.4308439 13.5257923 83.00 1.8 val. GPS-3D 3.0 7 2 41 12 44 21 47 25 44 26 41 29 44 31 41 0 0 0 0 0 0 0 0 0 0 normal 0 3.14 3.46 6 33399772.133 5809981.769 N/A N/A
2 13642 2/24/2020 6:00:17 PM 2/24/2020 6:00:17 PM Collar 9/1/2022 4:42:09 PM 3789026 911466 5032229 52.4308433 13.5257518 66.65 4.4 val. GPS-3D 7.0 5 21 47 25 44 26 41 29 41 31 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.20 3.46 5 33399769.376 5809981.762 N/A N/A
3 13642 2/24/2020 7:00:39 PM 2/24/2020 7:00:39 PM Collar 9/1/2022 4:42:09 PM 3789033 911470 5032242 52.4308594 13.5257849 81.67 1.4 val. GPS-3D 3.2 8 5 41 16 41 20 47 21 41 26 44 27 38 29 44 31 29 0 0 0 0 0 0 0 0 normal 0 3.18 3.42 4 33399771.664 5809983.505 N/A N/A
4 13642 2/24/2020 8:00:17 PM 2/24/2020 8:00:17 PM Collar 9/1/2022 4:42:09 PM 3789038 911470 5032239 52.4308083 13.5257677 82.26 2.6 val. GPS-3D 7.0 6 16 44 20 50 21 41 26 44 27 44 29 32 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.18 3.44 4 33399770.379 5809977.850 N/A N/A
5 13642 2/24/2020 9:00:09 PM 2/24/2020 9:00:09 PM Collar 9/1/2022 4:42:09 PM 3789032 911468 5032235 52.4308313 13.5257597 75.24 5.0 val. GPS-3D 7.4 5 16 44 20 44 21 41 26 41 27 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.22 3.46 4 33399769.890 5809980.415 N/A N/A
6 13642 2/24/2020 10:00:38 PM 2/24/2020 10:00:38 PM Collar 9/1/2022 4:42:09 PM 3789041 911468 5032243 52.4308128 13.5257288 86.92 1.2 val. GPS-3D 3.4 9 1 35 8 44 10 47 11 44 16 44 20 41 21 35 27 44 30 38 0 0 0 0 0 0 normal 0 3.20 3.40 5 33399767.744 5809978.401 N/A N/A
7 13642 2/24/2020 11:00:17 PM 2/24/2020 11:00:17 PM Collar 9/1/2022 4:42:09 PM 3789032 911467 5032229 52.4308001 13.5257454 70.34 2.2 val. GPS-3D 4.4 6 1 38 8 41 10 44 11 44 20 35 27 47 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.20 3.42 6 33399768.848 5809976.964 N/A N/A
8 13642 2/25/2020 12:00:08 AM 2/25/2020 12:00:08 AM Collar 9/1/2022 4:42:09 PM 3789035 911467 5032237 52.4308231 13.5257351 78.46 2.8 val. GPS-3D 5.2 5 1 44 8 44 10 44 11 38 27 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.22 3.48 7 33399768.199 5809979.543 N/A N/A
9 13642 2/25/2020 1:00:39 AM 2/25/2020 1:00:39 AM Collar 9/1/2022 4:42:09 PM 3789033 911470 5032236 52.4308265 13.5257849 76.91 1.6 val. GPS-3D 2.4 8 1 44 3 44 8 41 11 47 14 44 17 38 22 44 28 41 0 0 0 0 0 0 0 0 normal 0 3.22 3.44 8 33399771.589 5809979.849 N/A N/A
10 13642 2/25/2020 2:00:16 AM 2/25/2020 2:00:16 AM Collar 9/1/2022 4:42:09 PM 3789034 911470 5032239 52.4308360 13.5257814 79.88 1.8 val. GPS-3D 3.8 7 1 44 3 44 11 47 14 41 17 41 19 38 22 44 0 0 0 0 0 0 0 0 0 0 normal 0 3.24 3.48 8 33399771.377 5809980.912 N/A N/A
11 13642 2/25/2020 3:00:09 AM 2/25/2020 3:00:09 AM Collar 9/1/2022 4:42:09 PM 3789030 911470 5032236 52.4308473 13.5257952 75.14 2.2 val. GPS-3D 6.6 6 1 44 3 44 14 32 17 44 19 41 22 44 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.24 3.46 8 33399772.337 5809982.145 N/A N/A
12 13642 2/25/2020 4:00:38 AM 2/25/2020 4:00:38 AM Collar 9/1/2022 4:42:09 PM 3789036 911469 5032236 52.4308074 13.5257603 78.55 1.6 val. GPS-3D 3.2 8 1 38 2 38 6 44 9 44 19 44 22 47 23 44 31 38 0 0 0 0 0 0 0 0 normal 0 3.22 3.44 7 33399769.873 5809977.757 N/A N/A
13 13642 2/25/2020 4:54:41 AM 2/25/2020 4:54:41 AM Collar 9/1/2022 4:42:09 PM 3789043 911469 5032240 52.4307808 13.5257362 85.87 2.4 val. GPS-3D 4.6 7 2 41 3 47 4 38 6 44 9 44 22 35 23 41 0 0 0 0 0 0 0 0 0 0 Mortality no radius 0 3.22 3.46 7 33399768.176 5809974.837 N/A N/A
14 13642 2/25/2020 5:21:45 AM 2/25/2020 5:21:45 AM Collar 9/1/2022 4:42:09 PM 3789044 911474 5032243 52.4307820 13.5258042 89.55 2.8 val. GPS-3D 5.6 6 2 44 3 44 4 41 6 44 9 44 23 41 0 0 0 0 0 0 0 0 0 0 0 0 Mortality no radius 0 3.22 3.48 7 33399772.804 5809974.874 N/A N/A
44 13642 2/26/2020 9:00:16 AM 2/26/2020 9:00:16 AM Collar 9/1/2022 4:42:09 PM 3789037 911470 5032239 52.4308152 13.5257711 81.66 2.2 val. GPS-3D 4.6 7 5 47 7 44 13 44 15 41 27 29 28 47 30 41 0 0 0 0 0 0 0 0 0 0 Mortality no radius 0 3.22 3.44 4 33399770.628 5809978.615 N/A N/A
45 13642 2/26/2020 10:00:09 AM 2/26/2020 10:00:09 AM Collar 9/1/2022 4:42:09 PM 3789031 911468 5032232 52.4308218 13.5257632 72.27 3.4 val. GPS-3D 5.4 5 5 38 13 44 15 44 28 44 30 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mortality no radius 0 3.22 3.44 3 33399770.102 5809979.352 N/A N/A
46 13642 3/10/2020 9:01:37 PM 3/10/2020 9:01:37 PM Collar 9/1/2022 4:42:09 PM N/A N/A N/A N/A N/A N/A 48.6 No Fix N/A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.22 3.50 19 31166021.445 0.000 N/A N/A
47 13642 3/11/2020 1:01:37 AM 3/11/2020 1:01:37 AM Collar 9/1/2022 4:42:09 PM N/A N/A N/A N/A N/A N/A 48.6 No Fix N/A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.22 3.52 19 31166021.445 0.000 N/A N/A
48 13642 3/11/2020 3:01:37 AM 3/11/2020 3:01:37 AM Collar 9/1/2022 4:42:09 PM N/A N/A N/A N/A N/A N/A 48.6 No Fix N/A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.22 3.52 19 31166021.445 0.000 N/A N/A
I tried the following code, which I think would have worked if there weren't some places where the values are only separated by a single space (see pic of middle of the data set)
headers <- read.table(text = gsub("\\s{2,}", "\t", readLines(file_name)), sep="\t", header = FALSE, nrows = 1, as.is = TRUE)
Vec_data <- read.table(text = gsub("\\s{2,}", "\t", readLines(file_name)), sep="\t", skip = 3, header = FALSE)
colnames(Vec_data)= headers
That's a "fixed width format" (FWF) file, so we need read.fwf
and need to determine each column's widths.
I took one of the lines and counted to the end of each value. For instance:
4 13642 2/24/2020 8:00:17 PM 2/24/2020 8:00:17 PM Collar 9/1/2022 4:42:09 PM 3789038 911470 5032239 52.4308083 13.5257677 82.26 2.6 val. GPS-3D 7.0 6 16 44 20 50 21 41 26 44 27 44 29 32 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.18 3.44 4 33399770.379 5809977.850 N/A N/A
^^^^^^^^ 8 ^^^^^^^^^^^ 11
^^^^^^^^^ 9 ^^^^^^^^^^^^^ 13
(etc).
There are two ways to do this:
We can readLines()
the whole file and then subset [-(2:3)]
the second/third rows. If the file is huge, this may be less efficient (given R's global string hash), but if not then this is likely the easiest way to go.
P <- "path/to/file.dat"
widths <- c(8L, 9L, 11L, 13L, 11L, 13L, 13L, 11L, 13L, 10L, 10L, 10L, 13L,
13L, 9L, 5L, 14L, 9L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 28L,
9L, 13L, 6L, 21L, 21L, 9L, 9L)
quux <- read.fwf(textConnection(readLines(P)[-(2:3)]), widths = widths, header = TRUE)
# Warning in readLines(P) :
# incomplete final line found on '~/StackOverflow/23232022/quux.dat'
# Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names, :
# more columns than column names
Unfortunately, it does not work with this file :-(
(Note, sometimes this means that one or more of the columns are off by 1 or so ... "+1" in one and "-1" in another may fix it. I just didn't work on it here.)
Sometimes with FWF files, the column names don't perfectly align with the data columns. In this case, we need to skip some rows and then either find another set of widths
to use, or just set them manually.
quux <- read.fwf(P, widths = widths, skip = 3, header = FALSE)
head(quux, 3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 1 13642 2/24/2020 5:00:38 PM 2/24/2020 5:00:38 PM Collar 9/1/2022 4:42:09 PM 3789035 911471 5032242 52.4308439
# 2 2 13642 2/24/2020 6:00:17 PM 2/24/2020 6:00:17 PM Collar 9/1/2022 4:42:09 PM 3789026 911466 5032229 52.4308433
# 3 3 13642 2/24/2020 7:00:39 PM 2/24/2020 7:00:39 PM Collar 9/1/2022 4:42:09 PM 3789033 911470 5032242 52.4308594
# V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43
# 1 13.5257923 83.00 1.8 val. GPS-3D 3.0 7 2 41 12 44 21 47 25 44 26 41 29 44 31 41 0 0 0 0 0 0 0 0 0 0
# 2 13.5257518 66.65 4.4 val. GPS-3D 7.0 5 21 47 25 44 26 41 29 41 31 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 13.5257849 81.67 1.4 val. GPS-3D 3.2 8 5 41 16 41 20 47 21 41 26 44 27 38 29 44 31 29 0 0 0 0 0 0 0 0
# V44 V45 V46 V47 V48 V49 V50 V51 V52
# 1 normal 0 3.14 3.46 6 33399772 5809982 N/A N/A
# 2 normal 0 3.20 3.46 5 33399769 5809982 N/A N/A
# 3 normal 0 3.18 3.42 4 33399772 5809984 N/A N/A
For names, oddly enough we can read them in like this:
hdrs <- read.fwf(P, widths = widths, nrows = 1, header = FALSE)
trimws(unlist(hdrs, use.names = FALSE))
# [1] "No" "CollarID" "UTC Date" "UTC Time" "LMT Date" "LMT Time" "Origin" "SCTS Date" "SCTS Time" "ECEF X" "ECEF Y" "ECEF Z" "Latitude" "Longitude" "Height" "DOP" "FixType" "3D Error" "Sats" "Sat" "C/N"
# [22] "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat" "C/N" "Sat"
# [43] "C/N" "Mort. Status" "Activity" "Main Beacon" "Temp" "Easting" "Northing" "AnimalID" "GroupID"
Unfortunately, note that many of them are repeating ("Sat"
and "C/N"
), so you'd need to find a way to unique-ify them anyway.
After that, I suggest some basic cleanup, including removal of leading spaces (most fields), converting "N/A"
to R's NA
, and then after all of that convert them into classes that R knows about (since any column that sees "N/A"
will be a string column even if it really is numeric).
quux[] <- lapply(quux, function(z) {
if (is.character(z)) z <- trimws(z)
z <- ifelse(z == "N/A", z[NA], z)
type.convert(z, as.is=TRUE)
})
head(quux,3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51
# 1 1 13642 2/24/2020 5:00:38 PM 2/24/2020 5:00:38 PM Collar 9/1/2022 4:42:09 PM 3789035 911471 5032242 52.43084 13.52579 83.00 1.8 val. GPS-3D 3.0 7 2 41 12 44 21 47 25 44 26 41 29 44 31 41 0 0 0 0 0 0 0 0 0 0 normal 0 3.14 3.46 6 33399772 5809982 NA NA
# 2 2 13642 2/24/2020 6:00:17 PM 2/24/2020 6:00:17 PM Collar 9/1/2022 4:42:09 PM 3789026 911466 5032229 52.43084 13.52575 66.65 4.4 val. GPS-3D 7.0 5 21 47 25 44 26 41 29 41 31 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 normal 0 3.20 3.46 5 33399769 5809982 NA NA
# 3 3 13642 2/24/2020 7:00:39 PM 2/24/2020 7:00:39 PM Collar 9/1/2022 4:42:09 PM 3789033 911470 5032242 52.43086 13.52578 81.67 1.4 val. GPS-3D 3.2 8 5 41 16 41 20 47 21 41 26 44 27 38 29 44 31 29 0 0 0 0 0 0 0 0 normal 0 3.18 3.42 4 33399772 5809984 NA NA
(See here for the quux[] <-
thing.)
From here, now you'll have the challenges of converting dates and times to POSIXt