Anomaly Detection Template for TIBCO Spotfire®
This template detects anomalous data points in a dataset using an autoencoder algorithm. It features automated machine learning to facilitate use by business analysts and citizen data scientists. The Time Series release of the template includes time series analysis and clustering of anomalies
This analysis has been tested with Spotfire 7.10, TERR 4.4.0, CRAN packages data.table (version 1.10.4)
and h2o (version 22.214.171.124)
This analysis has been tested with Spotfire 7.10, TERR 4.4.0, CRAN packages data.table (version 1.10.4)
TIBCO Component Exchange License
Anomaly detection is a way of detecting abnormal behavior. This template uses an autoencoder machine learning model to specify expected behavior and then monitors new data to match and highlight unexpected behavior. Version 2 features automated machine learning to optimize model tuning parameters. The Time Series release includes time series analysis, so it can be used as a form of 'control chart', as well input component drill-down to find the most important features influencing a reconstruction error and clustering analysis to group and analyze similar groups of anomalies.
Very useful. Here is a video that explains the concepts as well: https://www.youtube.com/watch?v=Ebdp5Ao1o9o
Now it's time for Spotfire to extend the usecase for Advanced Statistic Modeling to build supervised and unsupervised models.
Anomaly Detection using TensorFlow with TIBCO Spotfire
TIBCO Spotfire’s Python Data Function enables users to install and readily use packages available on PyPi to build custom functionality into their dashboards. Users can execute custom python code and use the resulting document properties or tables to update visualizations on a Spotfire dashboard. We present an overview of the autoencoder implementation and uses in Autoencoder TensorFlow Python Data Function for TIBCO Spotfire® and the Anomaly Detection Template for TIBCO Spotfire®.
The following requirements must be met to enable running the Autoencoder data functions:
Spotfire 10.7 (or later) client and server
- Python packages pandas, numpy, scipy, scikit-learn, tensorflow must be installed for the Python data function to work. Both assets use TensorFlow version 2.5.0
The dataset used in both assets contains manufacturing equipment data captured during a few weeks across five plant locations with three different products (disclaimer: the data is likely fictitious and has been created for the purpose of the demo) . Various metrics from this period can help us identify abnormal behavior in our machines. Overall, the autoencoder and subsequent analyses can be used in real-time applications to proactively identify risks and mitigate them.
TensorFlow is an open-source software library used in the industry today for machine learning and deep learning. Keras is a deep learning Python API/interface for TensorFlow. More information on TensorFlow is available here and more information on Keras is available here.
TensorFlow has a rich ecosystem of APIs in many programming languages, ancillary products for serving models, visualization frameworks (e.x. tensorboard), and deployment packages for edge and hosted products.
Unsupervised feed-forward neural networks, also known as autoencoders, are an important deep learning technique that is used for a variety of use cases, including anomaly detection, multivariate regression, and dimension reduction.
Anomaly detection is a way of detecting abnormal behavior. This technique uses past data to learn a pattern of expected behavior. This pattern is compared across new and real-time events to highlight any abnormal or unexplained activity at a specific moment.
Some use cases for anomaly detection include:
Monitoring sensors on the edge devices
Financial or healthcare fraud
Manufacturing equipment early failure detection
Autoencoders are similar to normal feed-forward neural networks in that they can have multiple layers of neurons that attempt to understand a pattern in the dataset. However, unlike traditional feed-forward networks, autoencoders do not require a target (i.e dependent column). Instead, autoencoders have a set of layers for encoding the dataset and then replicate these layers in reverse order for decoding the encoded dataset. The output from the final decoded layer is the reconstructed data. Reconstruction error refers to the difference between the original data and the reconstructed data. In data points which lie in the same pattern space as the expected pattern, there is low reconstruction error. However, in data points where there is abnormal behavior (fall away from the expected pattern space), the reconstruction error is higher. Assessing the reconstruction errors helps us identify the data points that might be anomalous or require further examination.
More information on autoencoders and their variants is available here
Python Data Function
The TIBCO Exchange component for Autoencoder TensorFlow Python Data Function for TIBCO Spotfire® includes an .sfd file (the exported data function) and a Spotfire analysis file (DXP).
The ‘Overview’ page in the DXP goes over the required libraries, library and parameter documentation, and provides tips for running the function. To adjust the parameters or configure a different dataset, edit the ‘[Modeling] Autoencoder TensorFlow’ data function (Data -> Data function properties -> Edit Parameters). Change the Input Data to the desired data table and include the predictor columns (and optionally ID or Data Usage columns).
The ‘Build and Evaluate Model’ page in the DXP provides a UI on the left to tune the neural network parameters (although a user can also run with the predefined parameters). The two charts, ‘Histogram of Reconstruction Errors’ and ‘Loss per Epoch’ help us assess model training. See this blog for more information on how to build a good autoencoder model that will generalize to new datasets using these visuals.
The ‘Postprocessing’ page in the DXP analyzes the reconstruction errors from the autoencoder and provides an explainability component to the model. ‘Reconstruction Mean Squared Error over Time’ shows the reconstruction MSE for each data point. We want to study points with high reconstruction MSE. If we mark some of these points, ‘Top Features contributing to Reconstruction Error’ updates to show the top predictor columns contributing to this high error. When we select one feature, the trellised visual on the right updates to compare across time: the overall reconstruction MSE, the reconstruction MSE for the select feature, and the original data for that feature. The idea is to assess if higher reconstruction errors correspond with abnormal data points in the select feature/dimension.
Autoencoder Implementation using TensorFlow
The autoencoder data function is meant for beginner and advanced users. As seen in the input parameters (Data Function -> Edit Script). The only required parameter is the input data! The rest of the parameters are optional. Some of them are related to data preparation logistics (file_path for saving the model, id_column for attaching a unique identifier to the data, and data_usage_column so the user can specify their own train/test/validation splits), while the majority of parameters are related to the neural network. These neural network parameters have standard defaults: Huber loss, Adam optimizer, tanh activation, etc. Take a look at the readme documentation attached in the Exchange release or the data function parameter descriptions (pictured below) for more information.
We want to highlight a snippet of the TensorFlow code for creating autoencoder architectures. The default architecture is: dimensions of model data [original input] -> 200 neurons [encoder] -> 50 neurons [bottleneck] -> 200 neurons [decoder] -> dimensions of model data [reconstructed output]. Note, the term bottleneck refers to the compressed middle, hidden layer (often the smallest layer). If the user wants to specify their own architecture, they can give the encoder hidden layer sizes plus the bottleneck size as a comma-separated list. For example, on the ‘Build and Evaluate Model’ page, the given list ‘64, 32, 5’ tells us that the encoder sizes are 64 -> 32, the bottleneck size is 5, and then we create the decoder sizes 32 -> 64. The overall architecture is then: dimensions of model data [original input] -> 64 neurons [encoder] -> 32 neurons [encoder] -> 5 neurons [bottleneck] -> 32 neurons [decoder] -> 64 neurons [decoder] -> dimensions of model data [reconstructed output]. Dropout layers can be optionally added after each hidden layer. Line 291 can also be uncommented to enforce that the encoder and bottleneck sizes are strictly decreasing, as often seen in standard autoencoders.
Lastly, the autoencoder has multiple purposes. We save the bottleneck as a Numpy array; this is often a reduced dimension or representation of the data learned. On the ‘Find Golden Batch’ page, we use quantile cutoffs on the reconstruction errors to filter for a ‘golden batch’ (the ‘best data’) under nominal conditions.
Anomaly Detection Template
The Anomaly Detection Template for TIBCO Spotfire® is a full-scale data preparation, autoencoder and K-means modeling, and in-depth postprocessing analysis on the same dataset and can be used on any Time Series anomaly detection use case. It uses a variant of the same data function in Autoencoder TensorFlow Python Data Function for TIBCO Spotfire®, and includes other R data functions. For full instructions on using this template, reference the user guide within the Exchange release or the Dr. Spotfire video at the top of this wiki page.
The ‘Explore’ and ‘Model’ pages explore the columns of the user’s input data table, calculates summary statistics for these columns, facilitates choosing data to use in modeling, splits the time series data into train/test/validation sets, and sets predictor and variable types.
On the ‘Model’ and ‘Results’ pages, there are visuals that assess model training and postprocessing on the reconstruction errors similar to the ones in the python data function DXP. Again, these visuals are specific to autoencoder training and model evaluation.
Lastly, this template postprocesses the reconstruction errors using outlier cutoffs and K-means clustering to identify incidents and clusters of incidents over time. An incident is defined as a collection of at least 5 timestamps with reconstruction errors over a user-defined outlier cutoff. Incidents can be clustered together into similar groups and analyzed over time. This can be done both retrospectively or in conjunction with new (real-time) data.