Kylo is a full-featured Data Lake platform built on Apache Hadoop and Spark. Kylo provides a turn-key, business-friendly Data Lake solution enabling data ingest, data preparation, and data discovery.


License Apache 2.0

Major Features

Data Ingest Users can easily configure feeds in guided UI
Data Preparation Visual sql builder and data wrangling
Operations dashboard Feed health and service monitoring
Global search Lucene search against data and metadata

Data Processing

Data Ingest Guided UI for data ingest into Hive (extensible)
Data Export Export data to RDBMS or other targets
Data Wrangling Visually wrangle data and build/schedule recipes
PySpark, Spark Jobs Execute Spark jobs
Custom Pipelines Build and templatize new pipelines
Feed Chaining Trigger feeds based on dependencies and rules

Ingest Features

Batch Batch processing
Streaming Streaming processing
Snapshot/Incremental Loads Track highwater using date field or replace target
Schema Discovery Infer schema from source file samples
Data Validation Configure field validation in UI
Data Profile Automatically profile statistics
Data Cleanse/Standardization Easily configure field standardization rules
Custom Partitioning Configure Hive partitioning

Ingest Sources

FTP, SFTP Source from FTP, SFTP
Filesystem Poll files from a filesystem
HDFS, S3 Extract files from HDFS and S3
RDBMS Efficiently extract RDBMS data
JMS, KAFKA Source events from queues
REST, HTTP Source data from messages

Ingest Targets

HDFS Store data in HDFS
HIVE Store data in Hive tables
HBase Store data in HBase

Ingest Formats

ORC, Parquet, Avro, RCFile, Text Store data in popular table formats
Format Compression Specify compression for ORC and Parquet types
Extensible source formats Ability to define custom schema plug-in Serdes


Tag/Glossary Add tags to feeds for searchability
Business Metadata (extended properties) Add business-defined fields to feeds
REST API Powerful REST APIs for automation and integration
Visual Lineage Explore process lineage
Profile History View history of profile statistics
Search/Discover Lucene syntax search against data and metadata
Operational Metadata Extensive metadata capture


Keberos Support Supports Kerberized clusters
Obfuscation Configure field-level data protection
Encryption at Rest Compatible with HDFS encryption features
Access Control (LDAP, KDC, AD, SSO) Flexible security options
Data Protection UI configurable data protection policies
Application Groups, Roles Admin configured roles


Dashboard KPIs, alerts, performance, troubleshooting
Scheduler Timer, Cron-style based on Quartz engine
SLA Monitoring Service level agreements tied to feed performance
Alerts Alerts with integration options to enterprise
Health Monitoring Quickly identify feed and service health issues
Performance Reporting Pivot on performance statistics


Edge Clustering Scale edge resources