Search code examples
c#dotnetrdf

DotNetRDF: Graph or CompressingTurtleWriter does not release memory


I'm using dotnetRDF framework and C# to export graphs in turtle format for patients, creating one turtle file per patient. After about 400 patients the program stalls due to memory issues. Each turtle file is between 2 - 150 MB. The program occupies about 4GB of memory after 100 patients and 19GB after 500 patients, as shown in task manager.

I've a function in an export class that reads the data from an MSSQL server, creates the graph and at the end uses CompressingTurtleWriter to create a turtle file with the graph.

private int ExportPatient(string SubjectPseudoId)    
{
    Graph exportGraph = new Graph();
    AddNamespaces(exportGraph);

    // for each type of predicate
    {
       // read data from SQL (SqlConnection, SqlCommand and reader are using the using(){} statement)
       // for each datareader
       {
          // save values in subjectvalue, predicatevalue, objectvalue strings

          switch (objecttype)
          {
             case "string":
                exportGraph.Assert(new Triple(exportGraph.CreateUriNode(prefixRessource + EncodeIRI(dataprovidervalue + "-" + semanticDefinition.ClassName + "-" + subjectvalue)),
                exportGraph.CreateUriNode(semanticDefinition.AttributePrefixId + ":" + semanticDefinition.AttributeName),
                exportGraph.CreateLiteralNode(objectvalue, new Uri(XmlSpecsHelper.XmlSchemaDataTypeString))));
                break;
             case "double":
                exportGraph.Assert(new Triple(exportGraph.CreateUriNode(prefixRessource + EncodeIRI(dataprovidervalue + "-" + semanticDefinition.ClassName + "-" + subjectvalue)),
                exportGraph.CreateUriNode(semanticDefinition.AttributePrefixId + ":" + semanticDefinition.AttributeName),
                exportGraph.CreateLiteralNode(objectvalue, new Uri(XmlSpecsHelper.XmlSchemaDataTypeDouble))));
                break;
             case "datetime":
                exportGraph.Assert(new Triple(exportGraph.CreateUriNode(prefixRessource + EncodeIRI(dataprovidervalue + "-" + semanticDefinition.ClassName + "-" + subjectvalue)),
                exportGraph.CreateUriNode(semanticDefinition.AttributePrefixId + ":" + semanticDefinition.AttributeName),
                exportGraph.CreateLiteralNode(objectvalue, new Uri(XmlSpecsHelper.XmlSchemaDataTypeDateTime))));
                break;
             case "uri":
                exportGraph.Assert(new Triple(exportGraph.CreateUriNode(prefixRessource + EncodeIRI(dataprovidervalue + "-" + semanticDefinition.ClassName + "-" + subjectvalue)),
                exportGraph.CreateUriNode(semanticDefinition.AttributePrefixId + ":" + semanticDefinition.AttributeName),
                exportGraph.CreateUriNode(prefixRessource + EncodeIRI(dataprovidervalue + "-" + semanticDefinition.Range + "-" + objectvalue))));  //
                break;
             default:
                log.Warn("undefined objecttype=" + objecttype, process, runConfig.Project);
                break;
          } // switch
      } // for each datareader
   } // for each predicate
   // all the triplets are added to the graph, write it to the turtle file now.

   CompressingTurtleWriter turtlewriter = new CompressingTurtleWriter(5, TurtleSyntax.W3C);
   turtlewriter.PrettyPrintMode = true;
   turtlewriter.Save(exportGraph, CreateFileName(SubjectPseudoId));
   
   // dispose of the graph class
   exportGraph.Dispose();

} 
// return control to the calling function to process the next patient
// take the next SubjectPseudoId and call the function again until array is processed.

What I've tried so far is to Dispose or Finalize the CompressingTurtleWriter but both methods don't exist even those https://www.dotnetrdf.org/api/html/T_VDS_RDF_Writing_CompressingTurtleWriter.htm#! suggest that CompressingTurtleWriter has a protected Finalize() method.

The Graph I Dispose() before exiting the function.

I tried to solve the issue with .Net5.0 and .Net Core 3.1 but the behaviour is the same. I also tried to run this function as a Task but it didn't change the memory issue.

I did run the VS Diagnostic tools and created a snapshot after the exportGraph.Dispose(); it shows after extract 15:

Object Type                                     Count        Size(Bytes)    InclusiveSize (Bytes)
VDS.Common.Tries.SparseCharacterTrieNode<Uri>   5'823'385    326'109'560    1'899'037'768

and after extract 25:

Object Type                                     Count        Size(Bytes)    InclusiveSize (Bytes)
VDS.Common.Tries.SparseCharacterTrieNode<Uri>   11'882'772   665'435'232    1'540'054'160

In Task Manager the program uses after 25 extracts 1'646'964 K versus about 250'000 K at the start of the program.

The total size of the 25 Extract files is about 302 MB.

I can't see any issue in my code and I wonder why are there some many VDS.Common.Tries.SparseCharacterTrieNode<Uri> still in the heap?

Did anybody make similar experience or has an idea how to solve this?


Solution

  • I think the problem is that dotNetRDF is caching all of the URIs that are created during the creation of each graph and that cache is a global cache. I would suggest setting VDS.RDF.Options.InternUris to false before starting processing - this is a global setting so it only needs to be done once at the start of your program.

    You can also reduce memory usage of each individual graph by opting for just simple indexing (set VDS.RDF.Options.FullTripleIndexing to false), or by using the NonIndexedGraph instead of the default Graph implementation (this is assuming all you are doing is generating and then serializing the graphs). There are some tips on reducing memory usage here.