Kylo Spark Properties¶
Overview¶
The kylo-spark-shell process compiles and executes Scala code for schema detection and data transformations. It is started in the background by kylo-services as needed.
There will be at least two processes. The first process is used for schema detection of sample files. The second process is used for executing data transformations and may start additional processes if user impersonation is enabled.
Once the process has started it will call back to kylo-services and register itself. This allows Spark to run in yarn-cluster mode as the driver can run on any node in the cluster.
The auth-spark Spring profile must be enabled in kylo-services for the Spark client to start.
Configuration¶
The default location of the configuration file is at /opt/kylo/kylo-services/conf/spark.properties
.
Spark Properties¶
The default property values should work on most systems. An error will be logged if Kylo is unable to determine the correct value from the environment.
Property | Type | Default | Description |
---|---|---|---|
spark.shell.appResource | String | Path to the kylo-spark-shell-client jar file. This is only needed if Kylo is unable to determine the location automatically. The default location for Spark 1.x is /opt/ kylo/kylo-services/lib/app/kylo-spark- shell-client-v1-*.jar . There is a v2 jar for Spark 2.x. |
|
spark.shell.deployMode | String | Whether to launch a kylo-spark-shell process locally ( client ) or on one of the worker machines inside the cluster ( cluster ). Set to cluster when enabling user impersonation. |
|
spark.shell.files | String | Additional files to be submitted with the Spark application. Multiple files should be separated with a comma. |
|
spark.shell.javaHome | String | The JAVA_HOME for launching the Spark application. |
|
spark.shell.idleTimeout | Number | 900 | Indicates the amount of time in seconds to wait for a user request before terminating a kylo-spark-shell process. Any user request sent to the process will reset this timeout. This is only used in yarn-cluster mode. |
spark.shell.jars | String | Additional jars to be submitted with the Spark application. Multiple jars should be separated with a comma. |
|
spark.shell.master | String | Where to run Spark executors locally ( local ) or inside a YARN cluster (yarn ). Set to yarn when enabling user impersonation. |
|
spark.shell.portMin | Number | 45000 | Minimum port number that a kylo-spark-shell process may listen on. |
spark.shell.portMax | Number | 45999 | Maximum port number that a kylo-spark-shell process may listen on. |
spark.shell.propertiesFile | String | A custom properties file with Spark configuration for the application. |
|
spark.shell.proxyUser | Boolean | false | Set to true to enable Multi-User mode. |
spark.shell .registrationKeystorePassword |
String | Password to keystore when registrationUrl uses SSL. |
|
spark.shell .registrationKeystorePath |
String | Path to keystore when registrationUrl uses SSL. |
|
spark.shell.registrationUrl | String | Kylo Services URL for registering the Spark application once it has started. This defaults to http://<server-address>:8400/ proxy/v1/spark/shell/register |
|
spark.shell.sparkArgs | String | Additional arguments to include in the Spark invocation. |
|
spark.shell.sparkHome | String | A custom Spark installation location for the application. |
|
spark.shell.verbose | Boolean | false | Enables verbose reporting for Spark Submit. |
Example spark.properties
configuration for yarn-cluster mode:
spark.shell.deployMode = cluster
spark.shell.master = yarn
spark.shell.files = /opt/kylo/kylo-services/conf/log4j-spark.properties,/opt/kylo/kylo-services/conf/spark.properties
spark.shell.jars = /opt/kylo/kylo-services/lib/mariadb-java-client-1.5.7.jar
spark.shell.sparkArgs = --driver-memory 512m --executor-memory 512m --driver-java-options -Dlog4j.configuration=log4j-spark.properties
Example spark.properties
configuration for local mode:
spark.shell.master = local[1]
spark.shell.sparkArgs = --driver-memory 512m --executor-memory 512m --driver-class-path /opt/kylo/kylo-services/conf:/opt/kylo/kylo-services/lib/mariadb-java-client-1.5.7.jar --driver-java-options -Dlog4j.configuration=log4j-spark.properties
If user impersonation (spark.shell.proxyUser
) is enabled then Hadoop must be configured to allow the kylo user to proxy users:
$ vim /etc/hadoop/conf/core-site.xml
<property>
<name>hadoop.proxyuser.kylo.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.kylo.hosts</name>
<value>*</value>
</property>
Kerberos¶
If user impersonation (spark.shell.proxyUser
) is disabled then the Kerberos principal and keytab are passed to Spark which will acquire the Kerberos ticket.
If user impersonation is enabled then Kylo will periodically execute kinit to ensure there is an active Kerberos ticket. This prevents the impersonated user from having access to the keytab file. See Enable Hive User Impersonation for more information on configuring user impersonation in a Kerberized environment.
Property | Description |
---|---|
kerberos.spark.kerberosEnabled | Indicates that an active Kerberos ticket is needed to start a kylo-spark-shell process. Type: Boolean Default: false |
kerberos.spark.kerberosPrincipal | Name of the principal for acquiring a Kerberos ticket. Type: String |
kerberos.spark.keytabLocation | Local path to the keytab for acquiring a Kerberos ticket. Type: String |
kerberos.spark.initInterval | Indicates the amount of time in seconds to cache a Kerberos ticket before acquiring a new one. Only used when user impersonation is enabled. A value of 0 disables calling kinit. Type: Number Default: 43200 |
kerberos.spark.initTimeout | Indicates the amount of time in seconds to wait for kinit to acquire a ticket before killing the process. Only used when user impersonation is enabled. Type: Number Default: 10 |
kerberos.spark.retryInterval | Indicates the amount of time in seconds to wait before retrying to acquire a Kerberos ticket if the last try failed. Only used when user impersonation is enabled. Type: Number Default: 120 |
kerberos.spark.realm | Name of the Kerberos realm to append to usernames. Type: String |
Example spark.properties
configuration:
spark.shell.deployMode = cluster
spark.shell.master = yarn
spark.shell.proxyUser = true
spark.shell.sparkArgs = --driver-java-options -Djavax.security.auth.useSubjectCredsOnly=false
kerberos.spark.kerberosEnabled = true
kerberos.spark.kerberosPrincipal = kylo
kerberos.spark.keytabLocation = /etc/security/keytabs/kylo.headless.keytab
Logging¶
Spark application logs are written to the kylo-services.log file by default. This can be customized with the following properties added to /opt/kylo/kylo-services/conf/log4j.properties:
log4j.additivity.org.apache.spark.launcher.app.SparkShellApp=false
log4j.logger.org.apache.spark.launcher.app.SparkShellApp=INFO, sparkShellLog
log4j.appender.sparkShellLog=org.apache.log4j.DailyRollingFileAppender
log4j.appender.sparkShellLog.File=/var/log/kylo-services/kylo-spark-shell.log
log4j.appender.sparkShellLog.append=true
log4j.appender.sparkShellLog.layout=org.apache.log4j.PatternLayout
log4j.appender.sparkShellLog.Threshold=INFO
log4j.appender.sparkShellLog.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %t:%c{1}:%L - %m%n
Deprecated Properties¶
The kylo-spark-shell process can be run independently of kylo-services by setting the spark.shell.server.host
and spark.shell.server.port
properties. In this mode, the other spark.shell.
properties are ignored and should be passed to spark-submit
when starting kylo-spark-shell.
Property | Description |
---|---|
server.port | Port for kylo-spark-shell to listen on. Type: Number Default: 8450 |
spark.shell.server.host | Host name or address where the kylo-spark-shell process is running as a server. Type: String |
spark.shell.server.port | Port where the kylo-spark-shell process is listening. Type: Number Default: 8450 |
spark.ui.port | Port for the Spark UI to listen on. Type: Number Default: 8451 |
Advanced options are available by using Spring Boot properties.
Example spark.properties
configuration:
spark.shell.server.host = localhost
spark.shell.server.port = 8450
Wrangler Properties¶
These properties are used by the Data Transformation feed and the Visual Query page.
Property | Description |
---|---|
spark.shell.datasources.exclude | A comma-separated list of Spark datasources to exclude when saving a Visual Query transformation. May either be the short name or the class name. Type: String |
spark.shell.datasources.include | A comma-separated list of Spark datasource classes to include when saving a Visual Query transformation. Type: String |
spark.shell.datasources.exclude.downloads | A comma-separated list used to fine tune the datasources available for download by excluding from the master set of sources specified with the spark.shell.datasources root properties above. Uses short name only. Type: String |
spark.shell.datasources.include.tables | A comma-separated list used to fine tune the datasources available for saving to a table by excluding from the master set of sources specified with ithe spark.shell.datasources root properties above. Type: String |