Kylo Spark Properties

Overview

The kylo-spark-shell process compiles and executes Scala code for schema detection and data transformations. It is started in the background by kylo-services as needed.

There will be at least two processes. The first process is used for schema detection of sample files. The second process is used for executing data transformations and may start additional processes if user impersonation is enabled.

Once the process has started it will call back to kylo-services and register itself. This allows Spark to run in yarn-cluster mode as the driver can run on any node in the cluster.

The auth-spark Spring profile must be enabled in kylo-services for the Spark client to start.

Configuration

The default location of the configuration file is at /opt/kylo/kylo-services/conf/spark.properties.

Spark Properties

The default property values should work on most systems. An error will be logged if Kylo is unable to determine the correct value from the environment.

Property Type Default Description
spark.shell.appResource String   Path to the kylo-spark-shell-client jar file.
This is only needed if Kylo is unable to
determine the location automatically. The
default location for Spark 1.x is /opt/
kylo/kylo-services/lib/app/kylo-spark-
shell-client-v1-*.jar. There is a v2 jar for
Spark 2.x.
spark.shell.deployMode String   Whether to launch a kylo-spark-shell
process locally (client) or on one of the
worker machines inside the cluster
(cluster). Set to cluster when enabling
user impersonation.
spark.shell.files String   Additional files to be submitted with the
Spark application. Multiple files should be
separated with a comma.
spark.shell.javaHome String   The JAVA_HOME for launching the Spark
application.
spark.shell.idleTimeout Number 900 Indicates the amount of time in seconds
to wait for a user request before
terminating a kylo-spark-shell process.
Any user request sent to the process will
reset this timeout. This is only used in
yarn-cluster mode.
spark.shell.jars String   Additional jars to be submitted with the
Spark application. Multiple jars should be
separated with a comma.
spark.shell.master String   Where to run Spark executors locally
(local) or inside a YARN cluster (yarn).
Set to yarn when enabling user
impersonation.
spark.shell.portMin Number 45000 Minimum port number that a
kylo-spark-shell process may listen on.
spark.shell.portMax Number 45999 Maximum port number that a
kylo-spark-shell process may listen on.
spark.shell.propertiesFile String   A custom properties file with Spark
configuration for the application.
spark.shell.proxyUser Boolean false Set to true to enable Multi-User mode.
spark.shell
.registrationKeystorePassword
String   Password to keystore when
registrationUrl uses SSL.
spark.shell
.registrationKeystorePath
String   Path to keystore when registrationUrl
uses SSL.
spark.shell.registrationUrl String   Kylo Services URL for registering the
Spark application once it has started. This
defaults to http://<server-address>:8400/
proxy/v1/spark/shell/register
spark.shell.sparkArgs String   Additional arguments to include in the
Spark invocation.
spark.shell.sparkHome String   A custom Spark installation location for
the application.
spark.shell.verbose Boolean false Enables verbose reporting for Spark
Submit.

Example spark.properties configuration for yarn-cluster mode:

spark.shell.deployMode = cluster
spark.shell.master = yarn
spark.shell.files = /opt/kylo/kylo-services/conf/log4j-spark.properties,/opt/kylo/kylo-services/conf/spark.properties
spark.shell.jars = /opt/kylo/kylo-services/lib/mariadb-java-client-1.5.7.jar
spark.shell.sparkArgs = --driver-memory 512m --executor-memory 512m --driver-java-options -Dlog4j.configuration=log4j-spark.properties

Example spark.properties configuration for local mode:

spark.shell.master = local[1]
spark.shell.sparkArgs = --driver-memory 512m --executor-memory 512m --driver-class-path /opt/kylo/kylo-services/conf:/opt/kylo/kylo-services/lib/mariadb-java-client-1.5.7.jar --driver-java-options -Dlog4j.configuration=log4j-spark.properties

If user impersonation (spark.shell.proxyUser) is enabled then Hadoop must be configured to allow the kylo user to proxy users:

$ vim /etc/hadoop/conf/core-site.xml

<property>
  <name>hadoop.proxyuser.kylo.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.kylo.hosts</name>
  <value>*</value>
</property>

Kerberos

If user impersonation (spark.shell.proxyUser) is disabled then the Kerberos principal and keytab are passed to Spark which will acquire the Kerberos ticket.

If user impersonation is enabled then Kylo will periodically execute kinit to ensure there is an active Kerberos ticket. This prevents the impersonated user from having access to the keytab file. See Enable Hive User Impersonation for more information on configuring user impersonation in a Kerberized environment.

Property Description
kerberos.spark.kerberosEnabled Indicates that an active Kerberos ticket is needed to start a
kylo-spark-shell process.
Type: Boolean
Default: false
kerberos.spark.kerberosPrincipal Name of the principal for acquiring a Kerberos ticket.
Type: String
kerberos.spark.keytabLocation Local path to the keytab for acquiring a Kerberos ticket.
Type: String
kerberos.spark.initInterval Indicates the amount of time in seconds to cache a Kerberos ticket
before acquiring a new one. Only used when user impersonation is
enabled. A value of 0 disables calling kinit.
Type: Number
Default: 43200
kerberos.spark.initTimeout Indicates the amount of time in seconds to wait for kinit to
acquire a ticket before killing the process. Only used when user
impersonation is enabled.
Type: Number
Default: 10
kerberos.spark.retryInterval Indicates the amount of time in seconds to wait before retrying to
acquire a Kerberos ticket if the last try failed. Only used when
user impersonation is enabled.
Type: Number
Default: 120
kerberos.spark.realm Name of the Kerberos realm to append to usernames.
Type: String

Example spark.properties configuration:

spark.shell.deployMode = cluster
spark.shell.master = yarn
spark.shell.proxyUser = true
spark.shell.sparkArgs = --driver-java-options -Djavax.security.auth.useSubjectCredsOnly=false

kerberos.spark.kerberosEnabled = true
kerberos.spark.kerberosPrincipal = kylo
kerberos.spark.keytabLocation = /etc/security/keytabs/kylo.headless.keytab

Logging

Spark application logs are written to the kylo-services.log file by default. This can be customized with the following properties added to /opt/kylo/kylo-services/conf/log4j.properties:

log4j.additivity.org.apache.spark.launcher.app.SparkShellApp=false
log4j.logger.org.apache.spark.launcher.app.SparkShellApp=INFO, sparkShellLog

log4j.appender.sparkShellLog=org.apache.log4j.DailyRollingFileAppender
log4j.appender.sparkShellLog.File=/var/log/kylo-services/kylo-spark-shell.log
log4j.appender.sparkShellLog.append=true
log4j.appender.sparkShellLog.layout=org.apache.log4j.PatternLayout
log4j.appender.sparkShellLog.Threshold=INFO
log4j.appender.sparkShellLog.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %t:%c{1}:%L - %m%n

Deprecated Properties

The kylo-spark-shell process can be run independently of kylo-services by setting the spark.shell.server.host and spark.shell.server.port properties. In this mode, the other spark.shell. properties are ignored and should be passed to spark-submit when starting kylo-spark-shell.

Property Description
server.port Port for kylo-spark-shell to listen on.
Type: Number
Default: 8450
spark.shell.server.host Host name or address where the kylo-spark-shell process is running
as a server.
Type: String
spark.shell.server.port Port where the kylo-spark-shell process is listening.
Type: Number
Default: 8450
spark.ui.port Port for the Spark UI to listen on.
Type: Number
Default: 8451

Advanced options are available by using Spring Boot properties.

Example spark.properties configuration:

spark.shell.server.host = localhost
spark.shell.server.port = 8450

Wrangler Properties

These properties are used by the Data Transformation feed and the Visual Query page.

Property Description
spark.shell.datasources.exclude A comma-separated list of Spark datasources to exclude
when saving a Visual Query transformation. May either
be the short name or the class name.
Type: String
spark.shell.datasources.include A comma-separated list of Spark datasource classes to
include when saving a Visual Query transformation.
Type: String
spark.shell.datasources.exclude.downloads A comma-separated list used to fine tune the
datasources available for download by excluding from
the master set of sources specified with the
spark.shell.datasources root properties above.
Uses short name only.
Type: String
spark.shell.datasources.include.tables A comma-separated list used to fine tune the
datasources available for saving to a table by
excluding from the master set of sources specified with
ithe spark.shell.datasources root properties above.
Type: String