Search code examples
rxts

Clean column from duplicates being in blocks


Question: How can I clean data from "duplicates" within blocks. I use the term [blocks] to illustrate that 2 values (in same column) are equal and are positioned either above of below.

In column [c1] I have the values [2] and [3].

Value [2] should never have value [2] under.

Value [3] should never have value [3] under.

I cannot use a standard duplication removal function, because there will be duplicates in the column. It is not possible to delete the rows manual since they will be in amount of thousands.

If possible it would be good to solve without loading any R packages.

My R-file:

##########
# Test xts
##########
dates <- as.POSIXct(c
                    (
                      "2013-07-24 09:01:00", 
                      "2013-07-24 09:02:00", 
                      "2013-07-24 09:03:00",
                      "2013-07-24 09:04:00",
                      "2013-07-24 09:05:00",
                      "2013-07-24 09:06:00",
                      "2013-07-24 09:07:00"
                      )
                    )
c1 <- c(2,3,2,2,3,3,2)              # Data in c1.
# c2 <- c(0,3,2,2,3,0,2)            # Data in c2.
data <- data.frame(c1)              # Create a dataframe.
xts9 <- xts(x=data, order.by=dates) # Create xts based on dataframe.

The result of running the R-file:

                    c1
2013-07-24 09:01:00  2
2013-07-24 09:02:00  3
2013-07-24 09:03:00  2
2013-07-24 09:04:00  2
2013-07-24 09:05:00  3
2013-07-24 09:06:00  3
2013-07-24 09:07:00  2

Comments of which lines should be deleted:

                    c1
2013-07-24 09:01:00  2
2013-07-24 09:02:00  3
2013-07-24 09:03:00  2
2013-07-24 09:04:00  2 # To be remove due to having a 2 above.
2013-07-24 09:05:00  3
2013-07-24 09:06:00  3 # To be remove due to having a 2 above.
2013-07-24 09:07:00  2

Solution

  • We can use rleid function from data.table and then use duplicated to remove the repeating rows.

    library(data.table)
    xts9[!duplicated(rleid(xts9)), ]
    
    #                    c1
    #2013-07-24 09:01:00  2
    #2013-07-24 09:02:00  3
    #2013-07-24 09:03:00  2
    #2013-07-24 09:05:00  3
    #2013-07-24 09:07:00  2
    

    If you want to do this in base R, we can use rle instead using the same logic

    x <- rle(rowSums(xts9))
    xts9[!duplicated(rep(seq_along(x$values), x$lengths)), ]
    
    #                    c1
    #2013-07-24 09:01:00  2
    #2013-07-24 09:02:00  3
    #2013-07-24 09:03:00  2
    #2013-07-24 09:05:00  3
    #2013-07-24 09:07:00  2