Search code examples
rtimestampposixdate-formattingparticles

Functional programming for date-formatting air quality data


Environmental science researcher who is new to programming, go easy on me!

I have two air quality instruments (a dusttrak and a ptrak) that are logging data and storing them as .csv files. My goal is to automate the data cleaning process via functional programming. Each instrument records in different time intervals (30 seconds vs 1 second), and each instrument has a unique header.

I already have a function that reads the ptrak data. It deletes the obnoxious header and converts the date and time columns into one as.POSIX datetime. The result is a new wide format dataframe with only two columns, datetime and particle number concentration (pnc).

Here's the ptrak function:

## assume there is only one file per directory for now
read.ptrak<-function(fpath){
    x<-read.csv(fpath,skip=30,header=FALSE,stringsAsFactors=FALSE) #removing the first 30 rows of garbage
    colnames(x) <- c("date","time","pnc") #creating my own header
    ##merge date and time column together
    x$datetime<-strptime(paste(x$date,x$time), "%m/%d/%Y %H:%M:%S", tz="UTC")
    ## convert the first column to a posix timestamp
    x$datetime<-as.POSIXct(x$datetime,format=dt_format, tz="UTC")
    x<-x[,-c(1:2)] ## remove redundant variables date, and time
    x<-x[,c(2:1)] ## reorder columns so datetime is first
    return(x)
}

#okay now we can apply our function to our ptrak csv file:
ptrak_data <- read.ptrak(**INSERT FILE PATH HERE**)
head(ptrak_data)
#everything looks great!

Where I'm running into trouble is with the dusttrak data. Instead of having a date and time column for each observation, I am only provided with a starting time that is located within the header. The actual dataframe only provides the total elapsed time in 30 second intervals from this start time. I want to create a new dataframe that has a POSIX timestamp and five particle mass concentrations (see below) that I can later merge by datetime with the ptrak. Can anyone provide a function that uses the starting time and elapsed time to create a new datetime vector and then remove the header so I am left with a wide format dataframe with two columns?

Here's my first attempt at cleaning the dusttrak data:

read.dtrak<-function(fpath){
    x<-read.csv(fpath,skip=36,header=FALSE,stringsAsFactors=FALSE)
    colnames(x)<-c("elapsedtime","pm1","pm2.5","pm4","pm10","total","alarms","errors")
    ## need to read in the same file again and keep the header to extract the start time and start date:
    y<-read.csv(fpath,skip=6,header=FALSE,stringsAsFactors=FALSE)
    colnames(y)<-c("variable","value") ## somewhat arbitrary colnames for temporary df
    starttime <-y[1,2]
    startdate <-y[2,2]
    startdatetime <- strptime(paste(startdate,starttime), "%m/%d/%Y %H:%M:%S", tz="UTC")
    #convert to posix timestamp:
    startdatetime <-as.POSIXct(startdatetime, format=dt_format, tz="UTC")
    ## create a new variable called datetime in dataframe 'x'
    x$datetime <- startdatetime + x$elapsedtime  ## this is giving me the following error: "Error in unclass(e1) + unclass(e2) : non-numeric argument to binary operator
    return(x)
}

The end goal is to produce a cleaned dataframe that is similar to the ptrak data, except instead of reporting one particle number concentration (pnc), there needs to be PM1, PM2.5, PM4, PM10, and TOTAL (see dusttrak_data.csv).

Apologies in advance for not including sample data within the post. I could not figure out how to create sample data that included those pesky headers!

Finding an answer to this question would essentially save me +100 hours of manual data cleaning work so I greatly appreciate your insight!

Here's the data: Ptrak, Dusttrak EDIT: conversion of Dave2e's solution into a function for those who are interested.

read.dtrak<-function(fpath){
    sdate<-read.csv(fpath, header=FALSE, nrow=1, skip =7)
    stime <-read.csv(fpath, header = FALSE, nrow=1, skip=8)  
    startDate<-as.POSIXct(paste(sdate$V2, stime$V2), "%m/%d/%Y %H:%M:%S", tz="UTC")
    x<-read.csv(fpath, skip=36, stringsAsFactors = FALSE)
    names(x)<-c("elapsedtime","pm1","pm2.5","pm4","pm10","total","alarms","errors")
    x$elapsedtime<-x$elapsedtime+startDate
    x<-x[,-c(7,8)] #remove the alarms and errors variables
    names(x$elapsedtime)<-"datetime" #rename timestamp to datetime
    return(x)
}

read.dtrak("**INSERT FILE PATH HERE**")

Solution

  • This is a pretty straightforward problem, assuming each file has a constant number of lines in the header. A POSIXct object is a the number of seconds since the beginning. Since your data is in seconds it is just a matter of adding the elapsed time to the start time.

    I read the two line with the start date and time. Pasted the values together and converted to a datetime object and then read in the remaining data. Added the elapsed time to the start time and you are good to go.

    #pratice<-readLines("dusttrak_data.csv")
    #get start time and date then convert to POSIXct object
    stime<-read.csv("dusttrak_data.csv", header = FALSE, nrow=1, skip=6)
    sdate<-read.csv("dusttrak_data.csv", header = FALSE, nrow=1, skip=7)
    
    #read data, and add elasped time to start time
    startDate<-as.POSIXct(paste(sdate$V2, stime$V2), "%m/%d/%Y %I:%M:%S %p", tz="EST")
    df<-sdate<-read.csv("dusttrak_data.csv",   skip=36)
    names(df)<-c("elapsedtime", "PM1", "PM2.5", "PM4", "PM10", "TOTAL", "Alarms", "Errors")
    df$elapsedtime<-df$elapsedtime+startDate
    #removed columns 7 and 8
    df<-df[,-c(7:8)]
    

    You will need to adjust the timezone in the as.POSIXct function to match the sensors time.