Search code examples
node.jscsv

How to load very large csv files in nodejs?


I'm trying to load 2 big csv into nodejs, first one has a size of 257 597 ko and second one 104 330 ko. I'm using the filesystem (fs) and csv modules, here's my code :

fs.readFile('path/to/my/file.csv', (err, data) => {
  if (err) console.err(err)
  else {
    csv.parse(data, (err, dataParsed) => {
      if (err) console.err(err)
      else {
        myData = dataParsed
        console.log('csv loaded')
      }
    })
  }
})

And after ages (1-2 hours) it just crashes with this error message :

<--- Last few GCs --->

[1472:0000000000466170]  4366473 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5584.4 / 0.0 ms  last resort GC in old space requested
[1472:0000000000466170]  4371668 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5194.3 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 000002BDF12254D9 <JSObject>
    1: stringSlice(aka stringSlice) [buffer.js:590] [bytecode=000000810336DC91 o
ffset=94](this=000003512FC822D1 <undefined>,buf=0000007C81D768B9 <Uint8Array map
 = 00000352A16C4D01>,encoding=000002BDF1235F21 <String[4]: utf8>,start=0,end=263
778854)
    2: toString [buffer.js:664] [bytecode=000000810336D8D9 offset=148](this=0000
007C81D768B9 <Uint8Array map = 00000352A16C4D01>,encoding=000002BDF1...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memo
ry
 1: node::DecodeWrite
 2: node_module_register
 3: v8::internal::FatalProcessOutOfMemory
 4: v8::internal::FatalProcessOutOfMemory
 5: v8::internal::Factory::NewRawTwoByteString
 6: v8::internal::Factory::NewStringFromUtf8
 7: v8::String::NewFromUtf8
 8: std::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame
> >::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame> >
 9: v8::internal::wasm::SignatureMap::Find
10: v8::internal::Builtins::CallableFor
11: v8::internal::Builtins::CallableFor
12: v8::internal::Builtins::CallableFor
13: 00000081634043C1

The biggest file is loaded but node runs out of memory for the other. It's probably easy to allocate more memory, but the main issue here is the loading time, it seems very long despite the size of files. So what is the correct way to do it? Python loads these csv really fast with pandas btw (3-5 seconds).


Solution

  • Stream works perfectly, it took only 3-5 seconds :

    var csv = require('csv-parser')
    var data = []
    
    fs.createReadStream('path/to/my/data.csv')
      .pipe(csv())
      .on('data', function (row) {
        data.push(row)
      })
      .on('end', function () {
        console.log('Data loaded')
      })