I am dealing with files containing variable length textual data in the beginning followed by binary data. Specifically it is a so called "Table Oriented Binary File".
Simply put the textual data functions as the header of the binary table, containing descriptive column names as well as datatypes which define the length of each value in the columns.
Since the binary data has no delimiter this length is used to read each datapoint one after the other.(more: https://s.campbellsci.com/documents/us/manuals/loggernet34.pdf TOB1 section also small example below)
I came across an old repository(https://github.com/mlt/csdf.git) where the import was implemented (read.tob1.R) and started my approach from there because i also want to read TOB3 files which are similar but not implemented there.
The Import first loads the file in textmode
file.text <- file(file,"r")
to read the metainfo and the header with something like
header <- read.csv(nrow = 4)
While the file is read the byteposition is forwarded each time. After headers are read the byteposition is acquired with
pos <- seek(file.text)
This position is then assumed to be the end of the header and start of the binary data. Then the binary file will be loaded, set to the previous position
file.bin <- file(file,"rb")
seek(file.bin,pos)
and will be read from this position using dedicated functions for each datatype defined in the header. So For example first row is from type ULONG
which is 4 Bytes long so the next 4 Bytes will be read while the position is forwarded automatically.
All easy and good....but
For some reason when reading with read.csv()
the byteposition is forwarded more than seems to be the length of the rows.
If I read the first 4 lines of the textfile I end up at byteposition 3654
but when i read the same lines with readLines() from the binary file it ends up at 3152
. it seems that most of the offset happens when reading the first line and the the offset gets one byte smaller each line (See dump below). From Hexeditor and manually seeking the position in the file i know that the correct position to start from in the binary file is in fact 3152
.
Where does this difference come from and is there another way to find the start of the binary data? (I am on Windows 10)
> seek(testfile.text,0)
[1] 4096
> seek(testfile.bin,0)
[1] 95
> read.csv(testfile.text,header = FALSE,nrows = 1)
V1 V2 V3 V4 V5 V6 V7 V8
1 TOB1 Tower CR6 7562 CR6.Std.10.02 CPU:EasyFlux_Tower.cr6 22445 Flux_CSIFormat
> readLines(testfile.bin,n=1)
[1] "\"TOB1\",\"Tower\",\"CR6\",\"7562\",\"CR6.Std.10.02\",\"CPU:EasyFlux_Tower.cr6\",\"22445\",\"Flux_CSIFormat\""
> seek(testfile.text)
[1] 601
> seek(testfile.bin)
[1] 95
> 601-95
[1] 506
> read.csv(testfile.text,header = FALSE,nrows = 1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 SECONDS NANOSECONDS RECORD FC_mass FC_QC FC_samples LE LE_QC LE_samples H H_QC H_samples
V13 V14 V15 V16 V17 V18 V19 V20 V21 V22
1 NETRAD G SG energy_closure poor_energy_closure_flg Bowen_ratio TAU TAU_QC USTAR TSTAR
V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33
1 TKE TA_1_1_1 RH_1_1_1 T_DP_1_1_1 amb_e amb_e_sat TA_2_1_1 RH_2_1_1 T_DP_2_1_1 e e_sat
V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44
1 TA_3_1_1 RH_3_1_1 T_DP_3_1_1 e_probe e_sat_probe H2O_probe PA VPD Ux Ux_SIGMA Uy
V45 V46 V47 V48 V49 V50 V51 V52 V53 V54
1 Uy_SIGMA Uz Uz_SIGMA T_SONIC T_SONIC_SIGMA sonic_azimuth WS WS_RSLT WD_SONIC WD_SIGMA
V55 V56 V57 V58 V59 V60 V61
1 WD WS_MAX CO2_density CO2_density_SIGMA H2O_density H2O_density_SIGMA CO2_sig_strgth_Min
V62 V63 V64 V65 V66 V67 V68 V69 V70 V71
1 H2O_sig_strgth_Min P ALB SW_IN SW_OUT LW_IN LW_OUT T_nr_Avg LW_IN_meas LW_OUT_meas
V72 V73 V74 V75 V76 V77 V78
1 PPFD_IN sun_azimuth sun_elevation hour_angle sun_declination air_mass_coeff daytime
V79 V80 V81 V82 V83 V84 V85 V86
1 TS_1_1_1 SWC_1_1_1 cs65x_ec_1_1_1 G_plate_1_1_1 shfp_cal_1_1_1 FETCH_MAX FETCH_90 FETCH_55
V87 V88 V89 V90
1 FETCH_40 UPWND_DIST_INTRST FP_DIST_INTRST FP_EQUATION
> readLines(testfile.bin,n=1)
[1] "\"SECONDS\",\"NANOSECONDS\",\"RECORD\",\"FC_mass\",\"FC_QC\",\"FC_samples\",\"LE\",\"LE_QC\",\"LE_samples\",\"H\",\"H_QC\",\"H_samples\",\"NETRAD\",\"G\",\"SG\",\"energy_closure\",\"poor_energy_closure_flg\",\"Bowen_ratio\",\"TAU\",\"TAU_QC\",\"USTAR\",\"TSTAR\",\"TKE\",\"TA_1_1_1\",\"RH_1_1_1\",\"T_DP_1_1_1\",\"amb_e\",\"amb_e_sat\",\"TA_2_1_1\",\"RH_2_1_1\",\"T_DP_2_1_1\",\"e\",\"e_sat\",\"TA_3_1_1\",\"RH_3_1_1\",\"T_DP_3_1_1\",\"e_probe\",\"e_sat_probe\",\"H2O_probe\",\"PA\",\"VPD\",\"Ux\",\"Ux_SIGMA\",\"Uy\",\"Uy_SIGMA\",\"Uz\",\"Uz_SIGMA\",\"T_SONIC\",\"T_SONIC_SIGMA\",\"sonic_azimuth\",\"WS\",\"WS_RSLT\",\"WD_SONIC\",\"WD_SIGMA\",\"WD\",\"WS_MAX\",\"CO2_density\",\"CO2_density_SIGMA\",\"H2O_density\",\"H2O_density_SIGMA\",\"CO2_sig_strgth_Min\",\"H2O_sig_strgth_Min\",\"P\",\"ALB\",\"SW_IN\",\"SW_OUT\",\"LW_IN\",\"LW_OUT\",\"T_nr_Avg\",\"LW_IN_meas\",\"LW_OUT_meas\",\"PPFD_IN\",\"sun_azimuth\",\"sun_elevation\",\"hour_angle\",\"sun_declination\",\"air_mass_coeff\",\"daytime\",\"TS_1_1_1\",\"SWC_1_1_1\",\"cs65x_ec_1_1_1\",\"G_plate_1_1_1\",\"shfp_cal_1_1_1\",\"FETCH_MAX\",\"FETCH_90\",\"FETCH_55\",\"FETCH_40\",\"UPWND_DIST_INTRST\",\"FP_DIST_INTRST\",\"FP_EQUATION\""
> seek(testfile.text)
[1] 1610
> seek(testfile.bin)
[1] 1105
> 1610-1105
[1] 505
> read.csv(testfile.text,header = FALSE,nrows = 1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 SECONDS NANOSECONDS RN mg m-2 s-1 Grade samples W m-2 Grade samples W m-2 grade samples
V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
1 W m-2 W m-2 W m-2 fraction NA fraction kg m-1 s-2 grade m s-1 deg C m2 s-2 deg C %
V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43
1 deg C kPa kPa deg C % deg C kPa kPa deg C % deg C kPa kPa g/m^3 kPa hPa m s-1 m s-1
V44 V45 V46 V47 V48 V49 V50 V51 V52 V53
1 m s-1 m s-1 m s-1 m s-1 deg C deg C Decimal degrees m s-1 m s-1 decimal degrees
V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64
1 decimal degrees decimal degrees m s-1 mg m-3 mg m-3 g m-3 g m-3 fraction fraction mm %
V65 V66 V67 V68 V69 V70 V71 V72 V73
1 W m-2 W m-2 W m-2 W m-2 Klvin W m-2 W m-22 umolPhoton m-2 s-1 decimal degrees
V74 V75 V76 V77 V78 V79 V80 V81
1 decimal degrees decimal degrees decimal degrees adimensional fraction deg C % dS m-1
V82 V83 V84 V85 V86 V87 V88 V89 V90
1 W m-2 NA m m m m m % authors
> readLines(testfile.bin,n=1)
[1] "\"SECONDS\",\"NANOSECONDS\",\"RN\",\"mg m-2 s-1\",\"Grade\",\"samples\",\"W m-2\",\"Grade\",\"samples\",\"W m-2\",\"grade\",\"samples\",\"W m-2\",\"W m-2\",\"W m-2\",\"fraction\",\"\",\"fraction\",\"kg m-1 s-2\",\"grade\",\"m s-1\",\"deg C\",\"m2 s-2\",\"deg C\",\"%\",\"deg C\",\"kPa\",\"kPa\",\"deg C\",\"%\",\"deg C\",\"kPa\",\"kPa\",\"deg C\",\"%\",\"deg C\",\"kPa\",\"kPa\",\"g/m^3\",\"kPa\",\"hPa\",\"m s-1\",\"m s-1\",\"m s-1\",\"m s-1\",\"m s-1\",\"m s-1\",\"deg C\",\"deg C\",\"Decimal degrees\",\"m s-1\",\"m s-1\",\"decimal degrees\",\"decimal degrees\",\"decimal degrees\",\"m s-1\",\"mg m-3\",\"mg m-3\",\"g m-3\",\"g m-3\",\"fraction\",\"fraction\",\"mm\",\"%\",\"W m-2\",\"W m-2\",\"W m-2\",\"W m-2\",\"Klvin\",\"W m-2\",\"W m-22\",\"umolPhoton m-2 s-1\",\"decimal degrees\",\"decimal degrees\",\"decimal degrees\",\"decimal degrees\",\"adimensional\",\"fraction\",\"deg C\",\"%\",\"dS m-1\",\"W m-2\",\"\",\"m\",\"m\",\"m\",\"m\",\"m\",\"%\",\"authors\""
> seek(testfile.text)
[1] 2400
> seek(testfile.bin)
[1] 1896
> 2400-1896
[1] 504
> read.csv(testfile.text,header = FALSE,nrows = 1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 NA NA NA Smp Smp Tot Smp Smp Tot Smp Smp Tot Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp
V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45
1 Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp
V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67
1 Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Smp Min Min Tot Smp Avg Smp Smp
V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89
1 Smp Smp Avg Avg Avg Avg Avg Avg Avg Avg Tot Avg Avg Avg Smp Smp Smp Smp Smp Smp Smp Smp
V90
1 Smp
> readLines(testfile.bin,n=1)
[1] "\"\",\"\",\"\",\"Smp\",\"Smp\",\"Tot\",\"Smp\",\"Smp\",\"Tot\",\"Smp\",\"Smp\",\"Tot\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Min\",\"Min\",\"Tot\",\"Smp\",\"Avg\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Avg\",\"Avg\",\"Avg\",\"Avg\",\"Avg\",\"Avg\",\"Avg\",\"Avg\",\"Tot\",\"Avg\",\"Avg\",\"Avg\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\",\"Smp\""
> seek(testfile.text)
[1] 2931
> seek(testfile.bin)
[1] 2428
> 2931-2428
[1] 503
> read.csv(testfile.text,header = FALSE,nrows = 1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 ULONG ULONG ULONG IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4
V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30
1 IEEE4 BOOL IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4
V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45
1 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4
V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60
1 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4
V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75
1 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4
V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89
1 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4 IEEE4
V90
1 ASCII(16)
> readLines(testfile.bin,n=1)
[1] "\"ULONG\",\"ULONG\",\"ULONG\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"BOOL\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"IEEE4\",\"ASCII(16)\""
> seek(testfile.text)
[1] 3654
> seek(testfile.bin)
[1] 3152
> 3654-3152
[1] 502
same problem when using readLines also for text data:
> readLines(testfile.text,n=1)
[1] "\"TOB1\",\"Tower\",\"CR6\",\"7562\",\"CR6.Std.10.02\",\"CPU:EasyFlux_Tower.cr6\",\"22445\",\"Flux_CSIFormat\""
> readLines(testfile.bin,n=1)
[1] "\"TOB1\",\"Tower\",\"CR6\",\"7562\",\"CR6.Std.10.02\",\"CPU:EasyFlux_Tower.cr6\",\"22445\",\"Flux_CSIFormat\""
> seek(testfile.text)
[1] 601
> seek(testfile.bin)
[1] 95
Header and first datarow when opening in editor:
"TOB1","Tower","CR6","7562","CR6.Std.10.02","CPU:EasyFlux_Tower.cr6","22445","Flux_CSIFormat"
"SECONDS","NANOSECONDS","RECORD","FC_mass","FC_QC","FC_samples","LE","LE_QC","LE_samples","H","H_QC","H_samples","NETRAD","G","SG","energy_closure","poor_energy_closure_flg","Bowen_ratio","TAU","TAU_QC","USTAR","TSTAR","TKE","TA_1_1_1","RH_1_1_1","T_DP_1_1_1","amb_e","amb_e_sat","TA_2_1_1","RH_2_1_1","T_DP_2_1_1","e","e_sat","TA_3_1_1","RH_3_1_1","T_DP_3_1_1","e_probe","e_sat_probe","H2O_probe","PA","VPD","Ux","Ux_SIGMA","Uy","Uy_SIGMA","Uz","Uz_SIGMA","T_SONIC","T_SONIC_SIGMA","sonic_azimuth","WS","WS_RSLT","WD_SONIC","WD_SIGMA","WD","WS_MAX","CO2_density","CO2_density_SIGMA","H2O_density","H2O_density_SIGMA","CO2_sig_strgth_Min","H2O_sig_strgth_Min","P","ALB","SW_IN","SW_OUT","LW_IN","LW_OUT","T_nr_Avg","LW_IN_meas","LW_OUT_meas","PPFD_IN","sun_azimuth","sun_elevation","hour_angle","sun_declination","air_mass_coeff","daytime","TS_1_1_1","SWC_1_1_1","cs65x_ec_1_1_1","G_plate_1_1_1","shfp_cal_1_1_1","FETCH_MAX","FETCH_90","FETCH_55","FETCH_40","UPWND_DIST_INTRST","FP_DIST_INTRST","FP_EQUATION"
"SECONDS","NANOSECONDS","RN","mg m-2 s-1","Grade","samples","W m-2","Grade","samples","W m-2","grade","samples","W m-2","W m-2","W m-2","fraction","","fraction","kg m-1 s-2","grade","m s-1","deg C","m2 s-2","deg C","%","deg C","kPa","kPa","deg C","%","deg C","kPa","kPa","deg C","%","deg C","kPa","kPa","g/m^3","kPa","hPa","m s-1","m s-1","m s-1","m s-1","m s-1","m s-1","deg C","deg C","Decimal degrees","m s-1","m s-1","decimal degrees","decimal degrees","decimal degrees","m s-1","mg m-3","mg m-3","g m-3","g m-3","fraction","fraction","mm","%","W m-2","W m-2","W m-2","W m-2","Klvin","W m-2","W m-22","umolPhoton m-2 s-1","decimal degrees","decimal degrees","decimal degrees","decimal degrees","adimensional","fraction","deg C","%","dS m-1","W m-2","","m","m","m","m","m","%","authors"
"","","","Smp","Smp","Tot","Smp","Smp","Tot","Smp","Smp","Tot","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Min","Min","Tot","Smp","Avg","Smp","Smp","Smp","Smp","Avg","Avg","Avg","Avg","Avg","Avg","Avg","Avg","Tot","Avg","Avg","Avg","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp"
"ULONG","ULONG","ULONG","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","BOOL","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","IEEE4","ASCII(16)"
|·< ç Γf¾ A ŒFÓÁ@À A ŒFL> A ŒFi3]Â=dÖ¿Ýè¿[øP= 𧻽ǩV= À@o>ŽzŸºoçD?\ÇqAõ.CB…•@¢JW?r¢Ü?ûÞ’Aõš BƳ’@‚·Y?ÿ™@¨pAdÁB"ÒgApîÓ?:§Û?"/GA䔡BLîJA§`á>ŒW??V1¿è‹]?¾Œ½t‘>¶šA!¨²> ‹À¢?ÀR?‹7—Cíè@B¨CfBñ%u@œm D9Ì@åJÊ@nä“=µ9q?ÉF^? ’X`À¼‹V?™©CÂC
CÑäRÂÚ#â¿ ^H
ýŠBìd•Âl|%ÃääAÿÿÿ ׂAyž@Bh¡÷<
>¨3«A„œÆBòC5ßC?í×BVÕªD:BÅBKljun et al
So i found a solution which works by not using a text based fileconnection at all. I just read the lines i need as text from the binary file with
file.bin <- file(filepath,"rb")
metadata <- readLines(file.bin,n=1)
close(file.bin)
where n
is the number of lines to be read and then use
read.csv(text = metadata,...)
to read these lines as table.
Still courious about what the initial problem is caused by. As @KJ described, it may be a problem with line endings which are different on unix and windows based systems.