Search code examples
filereplacerdfdbpedia

How to delete lines with specific subjects from an RDF file?


I have a file containing triple RDF (subject-predicate-object) in the turtle syntax (.ttl), and I have another file in which I only have some subjects.

For example:

<http://dbpedia.org/resource/AlbaniaHistory> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaHistory"@en .
<http://dbpedia.org/resource/AsWeMayThink> <http://www.w3.org/2000/01/rdf-schema#label> "AsWeMayThink"@en .
<http://dbpedia.org/resource/AlbaniaEconomy> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaEconomy"@en .
<http://dbpedia.org/resource/AlbaniaGovernment> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaGovernment"@en .

And in the other file I have, for example:

<http://dbpedia.org/resource/AlbaniaHistory>
<http://dbpedia.org/resource/AlbaniaGovernment>
<http://dbpedia.org/resource/Pérotin>
<http://dbpedia.org/resource/ArtificalLanguages>

I would like to get:

<http://dbpedia.org/resource/AlbaniaHistory> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaHistory"@en .
<http://dbpedia.org/resource/AlbaniaGovernment> <http://www.w3.org/2000/01/rdf-schema#label> "AlbaniaGovernment"@en .

So, I would like to remove from the first file the triples whose subjects are not in the second file. How could I get this?

I tried in java reading the contents of the second file in an arraylist and using the "contain" method to check if the subjects of each triple of the first file match any line in the second file, however it is too slow since the files are very big. How could I get this?

Thank you very much for helping


Solution

  • In Java, you could use an RDF library to read/write in streaming fashion and do some basic filtering.

    For example, using RDF4J's Rio parser you could create a simple SubjectFilter class that checks for any triple if it has the required subject:

    public class SubjectFilter extends RDFHandlerWrapper {
    
        @Override
        public void handleStatement(Statement st) throws RDFHandlerException {
           // only write the statement if it has a subject we want
           if (myListOfSubjects.contains(statement.getSubject()) {
              super.handleStatement(st);
           } 
        }
    }
    

    And then connect a parser to a writer that spits out the filtered content, something along these lines:

    RDFParser rdfParser = Rio.createParser(RDFFormat.TURTLE);
    RDFWriter rdfWriter = Rio.createWriter(RDFFormat.TURTLE,
                   new FileOutputStream("/path/to/example-output.ttl"));
    
    // link our parser to our writer, wrapping the writer in our subject filter
    rdfParser.setRDFHandler(new SubjectFilter(rdfWriter));
    
    // start processing
    rdfParser.parse(new FileInputStream("/path/to/input-file.ttl"), ""); 
    

    For more details on how to use RDF4J and the Rio parsers, see the documentation.

    As an aside: although this is perhaps more work than doing some command line magic with things like grep and awk, the advantage is that this is semantically robust: you leave interpretation of which bit of your data is the triple's subject to a processor that understands RDF, rather than taking an educated guess through regex ("it's probably the first URL on each line"), which may break in cases where the input file use a slightly different syntax variation.

    (disclosure: I am on the RDF4J development team)