Search code examples
rdata-cleaning

Find out row number when two conditions met


this question is related to R.

I have two data sets. Let say data sets A contains the following: Dataset A:

Date      Market_Cap
2017-1-1  10
2017-1-2  30
2017-1-1  50
2017-1-5  100
2017-1-5  200

Another B contains the following:

Date      Thr_Market_Cap
2017-1-1  30
2017-1-2  20
2017-1-5  110

I then want to compare dataset A and dataset B. The criteria is when the Date is the same and threshold market cap in dataset B is greater than the market cap in dataset A. Then I want to delete the row of dataset A of that entry when these two criteria satisfied.

The result after querying in dataset A is:

Data      Market_Cap
2017-1-2  30
2017-1-5  200
2017-1-1  50

My dataset A contains 43,261,925 rows and dataset B contains 500 rows.

Please take a look at my code

A variable is dataset A and B variable is dataset B. Both are data frame.

A_row=dim(A)[1]

B_row=dim(B)[1]

cores <- parallel::detectCores()
cl<-makeSOCKcluster(cores) #change the  to your number of CPU cores
registerDoSNOW(cl)

pb <- txtProgressBar(min=1, max=A, style=3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress=progress)

DEL <- foreach (i = 1:A_row, .options.snow=opts, 
.combine='rbind') %dopar% {
    for (j in 1:B_row){
        if (A$Date[i] == B$Date[j]){
            if(isTRUE(A$Market_Cap[i] < B$Thr_Market_Cap[j])){
                return(i)
            }   
        }
    }
}
close(pb)

DEL variable then contains list of all the rows number that satisfies the two criteria and then I can use it to delete the row in dataset A

Adj_A= A[,-c(DEL)]

I tried writing this code with parfor but it does not work, DEL always return NULL. If I write it in basic non-parallel computation for loop, it works flawlessly. But it takes ages due to the large file size...

Can some one comment on this code and I also want to know if there is any other way which uses build-in R function or dplyr to clean this data?

Much appreciated!


Solution

  • Join B to A, then filter. With dplyr:

    left_join(A, B, by = "Date") %>% filter(Thr_Market_Cap <= Market_Cap)
    

    If you want add %>% select(-Thr_Market_Cap) to get rid of the extra column.

    You're data is quite large, if you use data.table instead this will probably be faster:

    library(data.table)
    setDT(A, key = "Date")
    setDT(B, key = "Date")
    
    A = B[A, on = "Date"][Thr_Market_Cap <= Market_Cap, ]