Accelerator for Apache Spark
Analyze your Big Data FAST with the use of this accelerator. Gain insights into your historical data and act in real time on the current streams of data in conjunction with historical analysis to make crucial decisions when it matters.
TIBCO Spotfire® TIBCO StreamBase®
TIBCO Spotfire Statistics Services
TIBCO Live Datamart
TIBCO LiveView Web
TIBCO Component Exchange License
The Accelerator for Apache Spark empowers the TIBCO technologies which interoperate seamlessly with Big Data systems to combine streams of live data, pattern recognition and historical queries on extremely large volumes of data. Patterns in historical data are interactively analyzed with Spotfire and TERR and statistical or machine learning models trained on big data clusters using Spark, H2O or arbitrary R packages. The models gained from the historical analysis are then used on live streaming data to do predictive analysis and yield operational insight; thus giving the user the opportunity to act in real-time, when it matters most.
The DevZone Forums are a traditional threaded discussion service subscribed to by Accelerator Developers, Practitioners, and Customers with a shared interest in the TIBCO Event Processing and Streaming Analytics platforms. Accelerators are provided as fast start templates and design pattern examples and are supported as delivered. For all questions concerning Accelerator use and implementation, please open a new discussion in the DevZone forum here: http://devzone.tibco.com/forums/forums/list.page
Published: May 2016
Initial release May 16, 2016
June 23, 2016 v1.0.1 Spotfire Dashboard updates
The Spark Accelerator is an important newcomer to the Accelerator family. There are 40 new connectivity points included in the Accelerator that connect the TIBCO platform to Spark for machine learning, model monitoring, retraining, streaming analtyics, and automated action. We see this pattern of use in retail for customer analytics, fraud detection, algorithmic trading, and more. The presentation on the Spark Accelerator, which also featured customer Scotia Bank, who uses Hadoop in this configuration to discover new algorithms, is a great example that we shared recently. As the use of Spotfire and Streaming Analytics for managing IoT data increases, this pattern shown by this Accelerator will become, we believe, increasingly important. The other benefit of this Accelerator project was that we created built-in components ("operators") in StreamBase for SparkML and H20. Check it out and tell us what you think !
Accelerator for Apache Spark
"Big Data" has gained a lot of momentum. Vast amounts of operational data are collected and stored in Hadoop and other platforms on which historical analysis will be conducted. Business Intelligence tools and distributed statistical computing are used to find new patterns in this data and gain new insights and knowledge, that can then be leveraged for promotions, up- and cross-sell campaigns, improved customer experience or fraud detection.
One of the key challenges in such environments is to quickly turn these new found insights and patterns into action while processing operational business data in real time. This is necessary to ensure we make customers happy, increase revenue, optimize margin or prevent fraud when it matters most. "Fast Data" provides a stream processing approach to automate decisions and initiate actions in real-time that are based on the statistical insights as obtained from Big Data platforms.
With growth of Big Data solutions there is more and more need to use solutions focused on scalability. The traditional Big Data is still data oriented. It assumes the data already exists and needs to be processed. This approach is called Data at Rest.
The event processing aspects, both Complex Event Processing and Event Stream Processing, are inherently related to the message passing called Data in Motion.
Passing quickly small amounts of data raises challenges significantly different from the typical problems solved by massive data processing platforms. The popular Big Data solutions like Hadoop are optimized to defer the data movement to the latest possible time and to execute most of the logic in where the data is stored. It significantly increases data processing throughput, but at the same time reduces data mobility.
The approach where the collected data is analysed and visualised brings vital information to the decision makers. The process is typically based on historical data that leads to decision being made on potentially stale data. At the same time the results of the past data processing requires significant human processing like implementation and deployment of decision rules. This adds further delay and increases operational cost.
Contemporary enterprises expect to be able:
- To act on the fire hose of incoming events
- Execute analytics on the data as soon as it arrives
- Deploy the predictive models with minimal operational latency
TIBCO is real time integration solutions vendor and provides proven solutions for event processing. The Big Data solutions generate special challenges for event processing:
- Horizontally scalability for moving data
- Analytics friendly storage of the data
- Native support for execution of analytics models
The Big Data accelerator uses large retailer transaction monitoring and propensity prediction scenario to show how these challenges can be addresses with mix of TIBCO and open source products.
Conceptually, the Apache Spark Accelerator does four things:
- Capture - this includes several distinct and separate activites: connect to data sources to bring data into the system, then clense, normalize, and persist the data
- Analytics - this includes data didscover and model discovery. In this step, the historical data is analyzed to learn to be predictive.
- Streaming Analytics - here we execute on the predictive model given the real-time data streams
- Model Tracking - this includes tracking the real-time KPI
Benefits and Business Value
The key is the speed. Contrary to most popular solutions TIBCO products offer real time integration and reporting.
The Big Data accelerator shows an easy way to combine Fast Data with Big Data. With the proper approach the delay between data collection and analytics availability can be significantly reduce leading to the faster trend detection.
At the same time, combination of event processing products and live data mart executes vital performance indicators computation in the real time and exposes them immediately to the operations teams. Accurate real-time indicators, like anomaly detection or less than optimal model predictions, can be handled by automated processes to mitigate the losses and increase the profit.
At micro scale timely tracking customer behaviour combined with statistical models derived from massive data increases ability to hit the opportunity window.
The proposed solution grows with business. It is inherently scalable while it still retains real-time intelligence.
The Accelerator combines TIBCO portfolio with open source solutions. It demonstrates ability to convert the Fast Data to Big Data (or Data in Motion to Data at Rest).
The example retailer has large number of stores and POS terminals. The whole traffic is handled by the centralized solution. The transaction information with attached customer identification (loyalty card) is delivered using Apache Kafka message bus. Kafka is extremely horizontally scalable messaging solution. The advantage over TIBCO EMS is that it can be natively scaled out and in depending on the workload.
The messages are processed by event processing layer implemented with TIBCO StreamBase CEP. The events are enriched with customer history kept in Apache HBase. HBase is another horizontally scalable solution. It is column-oriented datastore with primary access by unique key. Its design allows to achieve constant-cost access to the customer data with millisecond-range latency. HBase, in the proposed solution, keeps customer transaction history. It provides lock-free constant-cost access for both reading the past transactions and appending new ones.
The StreamBase application does also transaction data enrichment by identifying categories for each purchased item. The categorized transactions are used to build customer numerical profile that is in turn processed by statistical models. As a result customer propensity to accept each of actual offers is evaluated. The offer with best score is sent back to the POS.
On the server side the categorized transaction information is retained to track sales performance and react on changing trends.
The transaction information accompanied with offer sent to the customer is passed to data analytics store in HDFS. HDFS is key component of Hadoop. It is distributed filesystem optimized to reliably store and process huge amounts of data. The HDFS is a perfect tool for Big Data analytics, but it performs poorly when the data is still moving. In particular the reliable writes of small chunks of data to HDFS are too slow to be directly used in event processing applications.
To store data efficiently in HDFS the accelerator uses staging approach. First the data is sent to Flume component. Flume aggregates data in batches and stores them in append-friendly Avro format.
The data in the HDFS cluster is used for customer behaviour modelling. TIBCO Spotfire component coordinates model training and deployment. The actual data access and transformation is performed by Apache Spark component. Spotfire communicates with Spark to aggregate the data and to process the data for model training. In order to improve the data access Spark is used to convert Avro files to analytics-friendly Parquet format in ETL process. The models are built with Spark and H2O. H2O is open source data analytics solution designed for distributed model processing.
With Spotfire the data scientist creates orchestrates model creation and submits the best models for hot-deployment. To do that Spotfire communicates with Zookeeper component that keeps the runtime configuration settings and passes the changes to all active instances of StreamBase, which closes the processing loop.
The EventProcessor is a StreamBase application. This component is central to the event processing with Fast Data story. The application consumes transaction messages from Kafka topic and applies regular event processing to them showing several possible techniques:
- in-memory enrichment
- state delegation
- model execution
- process tracking
Real Time Dashboard
The demo operational dashboard is Live DataMart instance. The intended purpose is to keep up-to-date ledger of recent data. The DataMart does not keep the whole event history.
At the moment it contains transactions data and transaction content that is visible in LiveView Web application. The user can drill-down from transactions to transaction content.
The Big Data resides in HDFS. It is typically too large to load in single process. The data service is a Spark application that exposes REST/HTTP interface (and Spark SQL, too). The application provides services:
- Access to preregistered views (like ranges or available categories)
- ETL process converting Avro to Parquet
- Featurization result preview
- Model training execution (with Sparkling Water)
Spotfire is main analytics and model lifecycle dashboard. It is used to:
- Inspect the collected data
- Prepare the models
- Review the trained models
- Build configurations
- Deploy models
Traffic simulator is a Jython (Java Python) application injecting transactions that can be consumed by event processing layer. The application sends events to Kafka topic.
Open Source Software Components:
- Apache Hadoop
- Apache Kafka
- Apache HBase
- Apache Zookeeper
- Apache Spark
- H2O, Sparking Water
See the QuickStart guide attached to this page.