Search code examples
out-of-memoryxqueryexport-to-csvbasex

BaseX - XQuery - Out of memory when writing results to CSV file


I am trying to write an XQuery-result to a CSV-file, see attached code (resulting in at least 1.6 millions lines, will problably become a lot more..). However several minutes into execution the program fails with an 'out of main memory' error. I am using a laptop with 4GB of memory. I would have thought that writing to file would prevent memory bottlenecks. Also, I am already using the copynode-false pragma.

I might have gone about the code the wrong way, since this is my first XQuery/BaseX-program. Or this might be non-solvable without extra hardware.. (current Database-SIZE: 3092 MB; NODES: 142477344) Any assistance would be much appreciated!

let $params := 
<output:serialization-parameters xmlns:output="http://www.w3.org/2010/xslt-xquery-serialization">
  <output:method value="csv"/>
  <output:csv value="header=yes, separator=semicolon"/>
</output:serialization-parameters>

return file:write(
  '/tmp/output.csv', 
  (# db:copynode false #){<csv>{
    for $stand in //*:stand       
    return <record>{$stand//*:kenmerk}</record>
      (: {$stand//*:identificatieVanVerblijfsobject}
      {$stand//*:inOnderzoek}
      {$stand//*:documentdatum}
      {$stand//*:documentnummer} :)
  }</csv>},
  $params
)

Solution

  • It’s a good idea to use the copynode pragma to save memory. In the given case, it’s probably the total amount of newly created element nodes that will simply consume too much memory before the data can be written to disk.

    If you have large data sets, the xquery serialization format may be the better choice. Maps and arrays consume less memory than XML nodes:

    let $params := map {
      'format': 'xquery',
      'header': true(),
      'separator': 'semicolon'
    }
    let $data := map {
      'names': [
        'kenmerk', 'inOnderzoek'
      ],
      'records': (
        for $stand in //*:stand
        return [
          string($stand//*:kenmerk),
          string($stand//*:inOnderzoek)
        ]
      )
    }
    return file:write-text(
      '/tmp/output.csv',
      csv:serialize($data, $params)
    )
    

    Another approach is to use the window clause and write the results in chunks:

    for tumbling window $stands in //*:stand
      start at $s when true()
      end   at $e when $e - $s eq 100000
    let $first := $s = 1
    
    let $path := '/tmp/output.csv'
    let $csv := <csv>{
      for $stand in $stands
      return <record>{
        $stand//*:kenmerk,
        $stand//*:inOnderzoek
      }</record>
    }</csv>
    let $params := map {
      'method': 'csv',
      'csv': map {
        'separator': 'semicolon',
        'header': $first
      }
    }
    return if ($first) then (
      file:write($path, $csv, $params)
    ) else (
      file:append($path, $csv, $params)
    )
    

    After the first write operation, subsequent table rows will be appended to the original file. The chunk size (here: 100000 rows per loop) can be freely adjusted. Similar as in your original code, the serialization parameters can also be specified as XML; and it’s of course also possible to use the xquery serialization format in the second example.