Search code examples
jenasemantic-webfusekisparqlwrapper

Uploading large RDF data to Apache Jena Fuseki server using python - form too large error


I am trying to upload RDF data stored in .ttl files in my computer to Apache Jena Fuseki server. I ran Apache Jena Fuseki server as a standalone server based on the guidance given in Apache Jena Fuseki server page(https://jena.apache.org/documentation/fuseki2/fuseki-webapp.html#fuseki-web-application) and an online article(https://medium.com/@fadirra/setting-up-jena-fuseki-with-update-in-windows-10-2c8a2802ee8f). The server seems to be running when I go to the localhost:3030. The code I developed for uploading the data seems to be working fine for smaller file sizes. However, for large file sizes, the data is not getting uploaded. On looking at the server logs, I identified the following error:

Caused by: java.lang.IllegalStateException: form too large > 20000000
        at org.eclipse.jetty.server.FormFields.checkMaxLength(FormFields.java:318) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.FormFields.parse(FormFields.java:307) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.FormFields.parse(FormFields.java:39) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.io.content.ContentSourceCompletableFuture.parse(ContentSourceCompletableFuture.java:104) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1212) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.handler.ContextRequest$OnContextDemand.run(ContextRequest.java:74) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.util.thread.SerializedInvoker$Link.run(SerializedInvoker.java:191) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.server.internal.HttpConnection$DemandContentCallback.succeeded(HttpConnection.java:679) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99) ~[fuseki-server.jar:5.0.0]
        at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[fuseki-server.jar:5.0.0]

Here is the code I used for uploading the RDF data:

input_location = "C:/......../Added_Triples.ttl"

with open(input_location, 'r') as f:
    content = f.read()

#print(type(content))
rdf_string_no_prefixes = "\n".join(line for line in content.split("\n") if not line.startswith("@prefix"))

update_query = """ 
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    PREFIX CSRO: <http://www.semanticweb.org/aagr657/ontologies/2023/9/CraneSpaceRepresentationOntology#>
    PREFIX LinkOnt: <http://purl.org/ConstructLinkOnt/LinkOnt#>
    PREFIX bot: <https://w3id.org/bot#>
    PREFIX expr: <https://w3id.org/express#>
    PREFIX geo: <http://www.opengis.net/ont/geosparql#>
    PREFIX geom: <http://rdf.bg/geometry.ttl#>
    PREFIX ifc: <https://standards.buildingsmart.org/IFC/DEV/IFC2X3/TC1/OWL>
    PREFIX inst: <https://www.ugent.be/myAwesomeFirstBIMProject#>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX sf: <http://www.opengis.net/ont/sf#>
    PREFIX omg: <https://w3id.org/omg#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX lbd: <https://linkedbuildingdata.org/LBD#>
    PREFIX props: <http://lbd.arch.rwth-aachen.de/props#>
    PREFIX unit: <http://qudt.org/vocab/unit/>
    PREFIX IFC4-PSD: <https://www.linkedbuildingdata.net/IFC4-PSD#>
    PREFIX smls: <https://w3id.org/def/smls-owl#>
    PREFIX fog: <https://w3id.org/fog#>
    PREFIX cc: <http://creativecommons.org/ns#>
    PREFIX dce: <http://purl.org/dc/elements/1.1/>
    PREFIX express: <https://w3id.org/express#>
    PREFIX list: <https://w3id.org/list#>
    PREFIX vann: <http://purl.org/vocab/vann/>
    PREFIX expr: <https://w3id.org/express#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX : <https://standards.buildingsmart.org/IFC/DEV/IFC2x3/TC1/OWL#>

    INSERT DATA {
        %s
    }
    """ % (rdf_string_no_prefixes)
    sparql = SPARQLWrapper("http://localhost:3030/your-dataset/update")
    sparql.setMethod(POST)
    sparql.setQuery(update_query)

    # Step 5: Execute the SPARQL Update query
    sparql.query()

I read a few questions on stackoverflow about similar errors in some other servers, which suggested to edit the jetty.xml file. However, in my case, I can not find any such file in my computer. As I mentioned above, the code works perfectly fine for smaller file sizes, but issue comes with bigger file sizes. For the time being, I divided the bigger RDF files into smaller chunks and uploaded them separately. However, that is taking a lot of time, as the time required for chunking is getting added. Therefore, I do not want to use this as a solution. Any help about how to solve this issue without the need of chunking will be appreciated. In ideal case, I would want the whole graph file to be uploaded in one go in least time.

Edit based on the answer to try request.post:

I tried the request.post method as well using the following code:

import requests
file_location = "C:/.........../Added_Triples.ttl"
sparql_endpoint = "http://localhost:3030/construction_dataset_2/update"  # Adjust the URL accordingly

headers = {'Content-Type': 'text/turtle;charset=utf-8'}
data = open(file_location, 'r').read()
response = requests.post(sparql_endpoint, headers=headers, data=data)

The error I am getting is as follows:

Exception has occurred: ConnectionError
('Connection aborted.', ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))
ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine

Also, the server logs show the following:

11:40:03 INFO  Fuseki          :: [23] 415 Unsupported Media Type (0 ms)
11:40:03 INFO  Fuseki          :: [24] POST http://localhost:3030/construction_dataset_2/update
11:40:03 INFO  Fuseki          :: [24] 415 Unsupported Media Type (0 ms)
11:42:17 INFO  Fuseki          :: [25] POST http://localhost:3030/construction_dataset_2/update

Solution

  • Instead of using a form and INSERT DATA (here, via SPARQLwrapper), try POSTing a file, with the Content-type header set appropriately.

    Or use an external process:

    curl -XPOST -T DATA.ttl --header "Content-type: text/turtle" http://localhost:3030/ds
    

    Or load the database (TDB2) before starting the server. This way can use the TDB2 buylk loaders.