Search code examples
rdplyrreshape2dcast

Using the Dcast function (reshape2) on large dataset


I have a dataframe that has dimensions of (325,928 x 2).

Below is a very small subset of that data:

Destination = c('A60001', 'A60001','A60001','A60001','A60001','A60001','A60001','A60001',
            'A60001','A60001','A60001','A60001','A60001','A60001','A60001','A60001',
            'A60001','A60001','A60001','A60001','A60001','A60001','A60001','A60001',
            'A60001', 'A60002', 'A60002','A60002','A60002','A60003')
Source = c('AA53', 'AA582', 'AA18', 'AA388', 'AA841', 'AA72', 'AA19', 'AA77', 'AA78', 'AA20', 'AA21',
       'AA12', 'AA412', 'AA634', 'AA591', 'AA859', 'AA157', 'AA254', 'AA167', 'AA176',
       'AA428', 'AA538', 'AA268', 'AA196', 'AA1250', 'AA23', 'AA16', 'AA692', 'AA196',
       'AA22')

df = data.frame(Destination, Source)

> df
   Destination Source
1       A60001   AA53
2       A60001  AA582
3       A60001   AA18
4       A60001  AA388
5       A60001  AA841
6       A60001   AA72
7       A60001   AA19
8       A60001   AA77
9       A60001   AA78
10      A60001   AA20
11      A60001   AA21
12      A60001   AA12
13      A60001  AA412
14      A60001  AA634
15      A60001  AA591
16      A60001  AA859
17      A60001  AA157
18      A60001  AA254
19      A60001  AA167
20      A60001  AA176
21      A60001  AA428
22      A60001  AA538
23      A60001  AA268
24      A60001  AA196
25      A60001 AA1250
26      A60002   AA23
27      A60002   AA16
28      A60002  AA692
29      A60002  AA196
30      A60003   AA22

Ultimate goal here is to transform this dataframe into a new dataframe using something similar to dcast because dcast cannot handle large amounts of data.

So here was the original code that I tried with this dataframe:

test<-dcast(cbind(df,V1 = rep(1,nrow(df))),`Source` ~ Destination,value.var='V1',fun.aggregate = length)

Output:

   Source A60001 A60002 A60003
1    AA12      1      0      0
2  AA1250      1      0      0
3   AA157      1      0      0
4    AA16      0      1      0
5   AA167      1      0      0
6   AA176      1      0      0
7    AA18      1      0      0
8    AA19      1      0      0
9   AA196      1      1      0
10   AA20      1      0      0
11   AA21      1      0      0
12   AA22      0      0      1
13   AA23      0      1      0
14  AA254      1      0      0
15  AA268      1      0      0
16  AA388      1      0      0
17  AA412      1      0      0
18  AA428      1      0      0
19   AA53      1      0      0
20  AA538      1      0      0
21  AA582      1      0      0
22  AA591      1      0      0
23  AA634      1      0      0
24  AA692      0      1      0
25   AA72      1      0      0
26   AA77      1      0      0
27   AA78      1      0      0
28  AA841      1      0      0
29  AA859      1      0      0

It works with the dataset I am providing but when I test it out with the full dataset of dimensions: 325,928 x 2, R crashes. Is there a better function that can produce the same output but handle larger amounts of data. If this isn't enough information, I can provide the full dataset privately to whoever thinks they can solve this ( i can't provide it here because StackOverflow can't read all the data) so you can test out the issue directly from the source.

Any help would be great, thanks!


Solution

  • Thanks to @Imo suggestion, this is the new solution to solving this:

    If your dataset is very large/wide, convert your dataframe to a data.table and then from there

    library(data.table)
    df1<-setDT(df)
    new3$value<-1
    trial<-dcast(new3, Source ~ Destination, fill = 0)
    

    This will give you the same result and can handle large amounts of data