Search code examples
apache-sparkamazon-emr

About Livy session for Jupyterhub on AWS EMR Spark


My customer has a AD connector configured on Jupyterhub installed on AWS EMR so that different users will be authenticated on jupyterhub via AD. The current understanding is when different users submit their spark job through Jupyter notebook on Jupyterhub to the shared underlying EMR spark engine, the spark job will be submitted via Livy to spark engine. Each Livy session will has a related spark session mapped to it(that is my current understanding and correct me if I am wrong)

The question is whether different Jupyterhub user will share the same Livy session (then different spark session) or different Livy session (then different spark session)?

The only limited material I can find is:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html

see this arch pic here

Thanks very much in advance!


Solution

  • As far as I know (tested on an HDP distribution) by default the Livy server will create a different Spark driver and so a different sessions for each user. The server is reachable through a kerberized HTTP interface, so the user has to come with a valid ticket and the corresponding session will be run under his name. It seems to be the way to go since, in this case, the user will have access to his own resources (data, YARN queue and so on). In this case, the livy server impersonates the user, it runs a Spark job as if it were the user (see Granting Livy the Ability to Impersonate.

    By checking in the doc I've seen that you can configure exactly in the same way the Livy server in EMR.

    By default, YARN jobs submitted this way run as user livy, regardless of the user who initiated the job. By setting up user impersonation you can have the user ID of the notebook user also be the user associated with the YARN job. Rather than having jobs initiated by both shirley and diego associated with the user livy, jobs that each user initiates are associated with shirley and diego respectively. This helps you to audit Jupyter usage and manage applications within your organization.

    So you have the choice to use impersonation (run as distinct users) or not (run as a single livy user).