Search code examples
daskblaze

dask.bag processing data out-of-memory


I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html

But still not work, my single machine is 32GB memory and 8 cores CPU.

My code below, I used to processing 10GB file even not work, the error is running couple of hours without any notification the jupyter was collapsed, i tried on Ubuntu and Windows both system is the same problem. So i suspect if dask bag can processing data out of memory? or is that my code incorrect?

The test data from http://files.pushshift.io/reddit/comments/

import dask.bag as db
import json
b = db.read_text('D:\RC_2015-01\RC_2012-04')
records = b.map(json.loads)
result = b.str.split().concat().frequencies().topk(10, lambda x: x[1])
%time f = result.compute()
f

Solution

  • Try setting a blocksize in the 10MB range when reading from the single file to break it up a bit.

    In [1]: import dask.bag as db
    
    In [2]: b = db.read_text('RC_2012-04', blocksize=10000000)
    
    In [3]: %time b.count().compute()
    CPU times: user 1.22 s, sys: 56 ms, total: 1.27 s
    Wall time: 20.4 s
    Out[3]: 19044534
    

    Also, as a warning, you create a bag records but then don't do anything with it. You might want to remove that line.