dask.bag processing data out-of-memory

I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html

But still not work, my single machine is 32GB memory and 8 cores CPU.

My code below, I used to processing 10GB file even not work, the error is running couple of hours without any notification the jupyter was collapsed, i tried on Ubuntu and Windows both system is the same problem. So i suspect if dask bag can processing data out of memory? or is that my code incorrect?

The test data from http://files.pushshift.io/reddit/comments/

import dask.bag as db
import json
b = db.read_text('D:\RC_2015-01\RC_2012-04')
records = b.map(json.loads)
result = b.str.split().concat().frequencies().topk(10, lambda x: x[1])
%time f = result.compute()
f

Solution

Try setting a blocksize in the 10MB range when reading from the single file to break it up a bit.

In [1]: import dask.bag as db

In [2]: b = db.read_text('RC_2012-04', blocksize=10000000)

In [3]: %time b.count().compute()
CPU times: user 1.22 s, sys: 56 ms, total: 1.27 s
Wall time: 20.4 s
Out[3]: 19044534

Also, as a warning, you create a bag records but then don't do anything with it. You might want to remove that line.