Search code examples
rparallel-processingapplyjobsna

Finding the percentages of missing information in each column in parallel using bigmemory and parallel packages in R


Here's what I want to do:

> library(parallel)
> library(bigmemory)
> big.mat=read.big.matrix("cp2006.csv",header=T)
Warning messages:
1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
3: In read.big.matrix("cp2006.csv", header = T) :
  Because type was not specified, we chose double based on the first line of data.
> jobs <- lapply(1:10, function(x) mcparallel(colMeans(is.na(big.mat))*100, name = big.mat))
Error in as.character.default(name) : 
  no method for coercing this S4 class to a vector
> res  <- mccollect(jobs)

However the problem is that is.na is not apparently applicable to big.matrix objects. I did a search on web and found mwhich which is the parallel version of which in bigmemory but unfortunately couldn't find a good tutorial on it to find the missing (NA) values in the column. So I am not sure what function I should feed to my mcparallel to make it work with big.matrix objects. In addition:

> col.NA.mean<-colMeans(is.na(big.mat))*100
Error in colMeans(is.na(big.mat)) : 
  'x' must be an array of at least two dimensions
In addition: Warning message:
In is.na(big.mat) : is.na() applied to non-(list or vector) of type 'S4'

Solution

  • I got the answer. When we call big.mat we should use [,] so here's the partial answer.

    > colMeans(is.na(big.mat[,]))
                 Year             Month        DayofMonth         DayOfWeek 
           0.00000000        0.00000000        0.00000000        0.00000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
           0.02102102        0.00000000        0.02402402        0.00000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           1.00000000        0.00000000        0.97997998        0.02402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
           0.00000000        0.02402402        0.02402402        0.02102102 
               Origin              Dest          Distance            TaxiIn 
           1.00000000        1.00000000        0.00000000        0.00000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
           0.00000000        0.00000000        1.00000000        0.00000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
           0.00000000        0.00000000        0.00000000        0.00000000 
    LateAircraftDelay 
           0.00000000 
    

    Here's the answer:

    library(parallel)
    library(bigmemory)
    big.mat=read.big.matrix("cp2006.csv",header=T)
    Warning messages:
    1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
    2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
    3: In read.big.matrix("cp2006.csv", header = T) :
    Because type was not specified, we chose double based on the first line of data.
    jobs <- lapply(1:10, function(x) mcparallel(colMeans(is.na(big.mat[,]))*100, name = big.mat))
    Error in as.character.default(name) : 
    no method for coercing this S4 class to a vector
    jobs <- lapply(1:10, function(x) mcparallel(colMeans(is.na(big.mat[,]))*100, name = big.mat[,]))
    res  <- mccollect(jobs)
    > res
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    $`2006`
                 Year             Month        DayofMonth         DayOfWeek 
             0.000000          0.000000          0.000000          0.000000 
              DepTime        CRSDepTime           ArrTime        CRSArrTime 
             2.102102          0.000000          2.402402          0.000000 
        UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
           100.000000          0.000000         97.997998          2.402402 
       CRSElapsedTime           AirTime          ArrDelay          DepDelay 
             0.000000          2.402402          2.402402          2.102102 
               Origin              Dest          Distance            TaxiIn 
           100.000000        100.000000          0.000000          0.000000 
              TaxiOut         Cancelled  CancellationCode          Diverted 
             0.000000          0.000000        100.000000          0.000000 
         CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
             0.000000          0.000000          0.000000          0.000000 
    LateAircraftDelay 
             0.000000 
    
    >