Search code examples
pythonjsonjsoniq

How to use JSONiq in a python program?


How can I use JSONiq to query JSON objects in a python program running on my PC?

I have managed to use JSONiq in a Jupyter Notebook running inside Visual Studio Code, but I'm interested in using JSONiq in a regular python program (*.py), rather than a notebook (*.ipynb).


Solution

  • It is possible to communicate between Python (outside of a notebook) and a running RumbleDB instance (JSONiq engine) via HTTP. This can be done by copy-pasting the same code that is running in Jupyter notebooks to support the %jsoniq magic into your own Python script. It is all open-source and available here:

    https://github.com/RumbleDB/pyrumble/blob/main/src/rumbledb/rumbledb.py

    You can, of course, adapt the signature of the function to your liking and needs.

    Running the server (same as for notebooks)

    Like for Jupyter notebooks, the RumbleDB server can run anywhere (locally on your own laptop, or on a cloud provider such as Amazon EMR, etc) as long as the firewall settings make the port accessible. The command to run a server on, say, port 8001 is:

    spark-submit path-to-rumbledb-1.21.jar serve -p 8001
    

    It is recommended to make sure only you have access to the port (e.g., by running locally and with a firewall, or by setting up IP restrictions on the cloud provider) because this is a relatively simple implementation with no security in mind.

    Passing small amounts of data

    Small amounts of data can be directly passed from Python to JSONiq through the query, e.g. by embedding some JSON into a global or let variable declaration in the JSONiq query string passed to RumbleDB.

    Small amounts of data can be passed from JSONiq to Python in the query results, which are sent via HTTP back to Python in the response body.

    Passing large amounts of data

    Larger amounts of data (csv, JSON Lines, text, etc) can be passed from Python to JSONiq by saving them on any file system both have access to (local, S3, HDFS, etc), and then reading them with JSONiq with the appropriate URI.

    Large amounts of data can be passed back from JSONiq to Python if the cloud provider supports the Apache Livy standard, in which case you can copy-paste this code instead. This runs the jar as a single job for every request, and it is then possible to specify the output path RumbleDB needs to write the output to. It can be TBs or even PBs of data if running on a cluster and writing back to cloud storage.

    We do have future plans to attempt a tighter integration of JSONiq into Python (which would "embed" RumbleDB in Python for a more natural interaction via memory rather than via a file system), but have no clear dates at this point.