Search code examples
rdataframemissing-datafunction-approximation

How can I prepare the data that some values are very different to group?


Some values are so different from group because there are missing rows and the data is not continuous make my diffVal unusual.

> df
                  Date diffVal1 diffVal2
1  2017-05-31 04:01:00      718      483
2  2017-05-31 05:01:00      704      477
3  2017-05-31 06:01:00      741      478
4  2017-05-31 07:01:00      874      483
5  2017-05-31 08:01:00      907      495
6  2017-05-31 09:01:00      887      510
7  2017-05-31 10:01:00     2922      514
8  2017-05-31 13:01:00     1012      529
9  2017-05-31 14:01:00      979      539
10 2017-05-31 15:01:00      886      485
11 2017-05-31 16:01:00      818      471

You can see that there are missing rows at Date (hour;11,12) I need to smooth the unusual value to normal.

I'm trying to set unusual value to NULL but the problem is how to know there are unusual value in data frame from big data frame if it is my example data frame I can set the data that is more then 1200 to NA (It is not good idea at all because it is not reasonable.) then use approximated NA values function na.approx() and I have to get plot of these later.

df$diffVal1 <- ifelse((df$diffVal1>1300), NA,df$diffVal1)
df$diffVal1 <- na.approx(df$diffVal1)
> df 
                  Date diffVal1 diffVal2
1  2017-05-31 04:01:00    718.0      483
2  2017-05-31 05:01:00    704.0      477
3  2017-05-31 06:01:00    741.0      478
4  2017-05-31 07:01:00    874.0      483
5  2017-05-31 08:01:00    907.0      495
6  2017-05-31 09:01:00    887.0      510
7  2017-05-31 10:01:00    949.5      514
8  2017-05-31 13:01:00   1012.0      529
9  2017-05-31 14:01:00    979.0      539
10 2017-05-31 15:01:00    886.0      485
11 2017-05-31 16:01:00    818.0      471

What should do with this problem? And how to add missing rows by Date to approximate from adding again?

Thank you so much for your helps.


Solution

  • Tell me if that works for you:

    data preparation :

    df <- read.table(text="Date; diffVal1; diffVal2
    1;  2017-05-31 04:01:00;      718;      483
    2;  2017-05-31 05:01:00;      704;      477
    3;  2017-05-31 06:01:00;      741;      478
    4;  2017-05-31 07:01:00;      874;      483
    5;  2017-05-31 08:01:00;      907;      495
    6;  2017-05-31 09:01:00;      887;      510
    7;  2017-05-31 10:01:00;     2922;      514
    8;  2017-05-31 13:01:00;     1012;      529
    9;  2017-05-31 14:01:00;      979;      539
    10; 2017-05-31 15:01:00;      886;      485
    11; 2017-05-31 16:01:00;      818;      471",sep=";",header=TRUE,stringsAsFactors=FALSE)
    
    df$Date     <- as.POSIXct(df$Date)
    df$diffVal1 <- as.numeric(df$diffVal1)
    df$diffVal2 <- as.numeric(df$diffVal2)
    all_dates <- data.frame(Date = seq(min(df$Date),max(df$Date),by=3600))
    

    work and result :

    df2 <- df
    df2 <- df2[order(df2$Date,decreasing=TRUE),]
    df2$Val1_total <-  cumsum(df2$diffVal1)
    df2 <- merge(df2,all_dates,all.y = TRUE)
    
    df2$Val1_total[is.na(df2$Val1_total)] <- approx(x = df2$Date, y = df2$Val1_total, xout = df2$Date[is.na(df2$Val1_total)])$y
    df2$diffVal1 <- c(-diff(df2$Val1_total),tail(df2$diffVal1,1))
    
    # > df2
    # Date diffVal1 diffVal2 Val1_total
    # 1  2017-05-31 04:01:00      718      483      11448
    # 2  2017-05-31 05:01:00      704      477      10730
    # 3  2017-05-31 06:01:00      741      478      10026
    # 4  2017-05-31 07:01:00      874      483       9285
    # 5  2017-05-31 08:01:00      907      495       8411
    # 6  2017-05-31 09:01:00      887      510       7504
    # 7  2017-05-31 10:01:00      974      514       6617
    # 8  2017-05-31 11:01:00      974       NA       5643
    # 9  2017-05-31 12:01:00      974       NA       4669
    # 10 2017-05-31 13:01:00     1012      529       3695
    # 11 2017-05-31 14:01:00      979      539       2683
    # 12 2017-05-31 15:01:00      886      485       1704
    # 13 2017-05-31 16:01:00      818      471        818