Spark User Impersonation Configuration

Overview

Users in Kylo have access to all Hive tables accessible to the kylo user by default. By configuring Kylo for a secure Hadoop cluster and enabling user impersonation, users will only have access to the Hive tables accessible to their specific account. A local spark shell process is still used for schema detection when uploading a sample file.

Requirements

This guide assumes that Kylo has already been setup with Kerberos authentication and that each user will have an account in the Hadoop cluster.

Kylo Configuration

Kylo will need to launch a separate spark shell process for each user that is actively performing data transformations. This means that the kylo-spark-shell service should no longer be managed by the system.

  1. Stop and disable the system process.
$ service kylo-spark-shell stop
$ chkconfig kylo-spark-shell off
  1. Add the auth-spark profile in application.properties. This will enable Kylo to create temporary credentials for the spark shell processes to communicate with the kylo-services process.
$ vim /opt/kylo/kylo-services/conf/application.properties

spring.profiles.include = auth-spark, ...
  1. Enable user impersonation in spark.properties. It is recommended that the yarn-cluster master be used to ensure that both the Spark driver and executors run under the user’s account. Using the local or yarn-client masters are possible but not recommended due the Spark driver running as the kylo user.
$ vim /opt/kylo/kylo-services/conf/spark.properties:

# Ensure these two properties are commented out
#spark.shell.server.host
#spark.shell.server.port

# Executes both driver and executors as the user
spark.shell.deployMode = cluster
spark.shell.master = yarn
# Enables user impersonation
spark.shell.proxyUser = true
# Reduces memory requirements and allows Kerberos user impersonation
spark.shell.sparkArgs = --driver-memory 512m --executor-memory 512m --driver-java-options -Djavax.security.auth.useSubjectCredsOnly=false

kerberos.spark.kerberosEnabled = true
kerberos.spark.kerberosPrincipal = kylo
kerberos.spark.keytabLocation = /etc/security/keytabs/kylo.headless.keytab
  1. Redirect logs to kylo-spark-shell.log. By default the logs will be written to kylo-services.log and include the output of every spark shell process. The below configuration instead redirects this output to the kylo-spark-shell.log file.
$ vim /opt/kylo/kylo-services/conf/log4j.properties

log4j.additivity.org.apache.spark.launcher.app.SparkShellApp=false
log4j.logger.org.apache.spark.launcher.app.SparkShellApp=INFO, sparkShellLog

log4j.appender.sparkShellLog=org.apache.log4j.DailyRollingFileAppender
log4j.appender.sparkShellLog.File=/var/log/kylo-services/kylo-spark-shell.log
log4j.appender.sparkShellLog.append=true
log4j.appender.sparkShellLog.layout=org.apache.log4j.PatternLayout
log4j.appender.sparkShellLog.Threshold=INFO
log4j.appender.sparkShellLog.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %t:%c{1}:%L - %m%n
  1. Configure Hadoop to allow Kylo to proxy users.
$ vim /etc/hadoop/conf/core-site.xml

<property>
  <name>hadoop.proxyuser.kylo.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.kylo.hosts</name>
  <value>*</value>
</property>