Search code examples
javardfjenattlrdf4j

Merge RDF .ttl files into one file database - filtering and keeping only the data/triples needed


I need to merge 1000+ .ttl files into one file database. How can I merge them with filtering the data in the source files and keep only the data needed in the target file?

Thanks


Solution

  • There's a number of options, but the simplest way is probably to have use a Turtle parser to read all the files, and let that parser pass its output to a handler which does the filtering before in turn passing the data to a Turtle writer.

    Something like this would probably work (using RDF4J):

      RDFWriter writer = org.eclipse.rdf4j.rio.Rio.createWriter(RDFFormat.TURTLE, outFile);
    
      writer.startRDF();
      for (File file : // loop over your 100+ input files) {
          Model data = Rio.parse(new FileInputStream(file), "", RDFFormat.TURTLE);
          for (Statement st: data) {
             if (// you want to keep this statement) {
                  writer.handleStatement(st);
             }
          }
      }
      writer.endRDF(); 
    

    Alternatively, just load all the files into an RDF Repository, and use SPARQL queries to get the data out and save to an output file, or if you prefer: use SPARQL updates to remove the data you don't want before exporting the entire repository to a file.

    Something along these lines (again using RDF4J):

     Repository rep = ... // your RDF repository, e.g. an in-memory store or native RDF database
    
     try (RepositoryConnection conn = rep.getConnection()) {
    
        // load all files into the database
        for (File file: // loop over input files) {
            conn.add(file, "", RDFFormat.TURTLE);
        }
    
        // do a sparql update to remove all instances of ex:Foo
        conn.prepareUpdate("DELETE WHERE { ?s a ex:Foo; ?p ?o }").execute();
    
        // export to file
        con.export(Rio.createWriter(RDFFormat.TURTLE, outFile));
     } finally {
        rep.shutDown(); 
     } 
    

    Depending on the amount of data / the size of your files, you may need to extend this basic setup a bit (for example by using transactions instead of just letting the connection auto-commit). But you get the general idea, hopefully.