Search code examples
rout-of-memoryaggregate-functionscalculated-columnslarge-data

Create column in R in a large database


My apologies if this question has already been answered, but I haven't found it. I'll post all my ideas to solve it. The problem is that the database is large and my PC cannot perform this calculation (core i7 and 8 GB RAM). I'm using Microsoft R Open 3.3.2 and RStudio 1.0.136.

I've trying to create a new column on a large database in R called tcm.RData (471 MB). My need is a column which divides Shape_Area by the sum of Shape_Area by COD (which I called ShapeSum). I first tried to do it in a single formula but, as it failed, I tried again in two steps with 1) summing up Shape_Area by COD and, if that succeed, to divide Shape_Area by ShapeSum.

> str(tcm)
    Classes ‘data.table’ and 'data.frame':  26835293 obs. of  15 variables:
    $ OBJECTID  : int  1 2 3 4 5 6 7 8 9 10 ...
    $ LAT       : num  -15.7 -15.7 -15.7 -15.7 -15.7 ...
    $ LONG      : num  -58.1 -58.1 -58.1 -58.1 -58.1 ...
    $ UF        : chr  "MT" "MT" "MT" "MT" ...
    $ COD       : num  510562 510562 510562 510562 510562 ...
    $ AREA_97   : num  1130 1130 1130 1130 1130 ...
    $ Shape_Area: num  255266.7 14875 25182.2 5503.9 95.5 ...
    $ TYPE      : chr  "2" "2" "2" "2" ...
    $ Nomes     : chr  NA NA NA NA ...
    $ NEAR_DIST : num  376104 371332 371410 371592 371330 ...
    $ tc_2004   : chr  "AREA_URBANA" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" ...
    $ tc_2008   : chr  "AREA_URBANA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" ...
    $ tc_2010   : chr  "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_LIMPO" ...
    $ tc_2012   : chr  "AREA_URBANA" "PASTO_SUJO" "PASTO_SUJO" "PASTO_SUJO" ...
    $ tc_2014   : chr  "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_SUJO" ...
    - attr(*, ".internal.selfref")=<externalptr> 

> tcm$ShapeSum <- tcm[, Shape_Area := sum(tcm$Shape_Area), by="COD"]
     Error: cannot allocate vector of size 204.7 Mb
     Error during wrapup: cannot allocate vector of size 542.3 Mb

I also tried the following codes, but all of them failed:

> tcm$ShapeSum <- apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by="COD")

Error in apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by = "COD") : dim(X) must have a positive lenght

> tcm$ShapeSum <- mutate(tcm, ShapeSum = sum(Shape_Area), by="COD", package = "dplyr")

Error: cannot allocate vector of size 204.7 Mb Error during wrapup: cannot allocate vector of size 542.3 Mb

> tcm$ShapeSum <- tcm[, transform(tcm, ShapeSum = sum(Shape_Area)), by="COD"]

> tcm$ShapeSum <- transform(tcm, aggregate(tcm$AreaShape, by=list(Category=tcm$COD), FUN=sum))

Error in aggregate.data.frame(as.data.frame(x), ...): no rows to aggregate

I thank very much for attention and for any suggestions to solve this problem.


Solution

  • We can use the data.table methods for creating the column as it is more efficient with the assignment (:=) which happens in place

    library(data.table)
    tcm[, ShapeSum := sum(Shape_Area), by = COD]
    

    Or as @user20650 suggested it could be (based on the OP's description)

    tcm[, ShapeSum := Shape_Area/sum(Shape_Area), by = COD]