Anomaly Detection with Machine Learning
What are Anomalies?
Anomaly detection is a way of detecting abnormal behavior. One definition of anomalies is "data points which do not conform to an expected pattern of the other items in the data set. Anomalies are referred to as a different distribution that occurs within a dataset. Anomalies in data translate to signiﬁcant (and often critical) actionable information in a wide variety of application domains." The figure below shows a simple example of anomalies (o1, o2, O3) in a 2D dataset. The autoencoder technique described here first uses machine learning models to specify expected behavior and then monitors new data to match and highlight unexpected behavior.
(Anomalies are similar, but not identical, to outliers. Outliers are points with a low probability of occurrence within a given data set. They are observation points that are distant from other observations. However, they don't necessarily represent abnormal behavior. Outliers in data warrant attention because they can distort predictions and affect model accuracy, if you don’t detect and handle them. For more information on detecting outliers in Spotfire, see this Wiki article: Top 10 methods for Outlier Detection. )
Use Cases for Anomaly Detection
Fighting Financial Crime – In the financial world, trillions of dollars’ worth of transactions happen every minute. Identifying suspicious ones in real time can provide organizations the necessary competitive edge in the market. Over the last few years, leading financial companies have increasingly adopted big data analytics to identify abnormal transactions, clients, suppliers, or other players. Machine Learning models are used extensively to make predictions that are more accurate.
Real-time Fraud Detection Accelerator
Monitoring Equipment Sensors – Many different types of equipment, vehicles and machines now have sensors. Monitoring these sensor outputs can be crucial to detecting and preventing breakdowns and disruptions. Unsupervised learning algorithms like Auto encoders are widely used to detect anomalous data patterns that may predict impending problems.
Healthcare claims fraud – Insurance fraud is a common occurrence in the healthcare industry. It is vital for insurance companies to identify claims that are fraudulent and ensure that no payout is made for those claims. The economist recently published an article that estimated $98 Billion as the cost of insurance fraud and expenses involved in fighting it. This amount would account for around 10% of annual Medicare & Medicaid spending. In the past few years, many companies have invested heavily in big data analytics to build supervised, unsupervised and semi-supervised models to predict insurance fraud.
TIBCO Cloud Risk Investigation App
Manufacturing defects – Auto encoders are also used in manufacturing for finding defects. Manual inspection to find anomalies is a laborious & offline process and building machine-learning models for each part of the system is difficult. Therefore, some companies implemented an auto encoder based process where sensor equipment data on manufactured components is continuously fed into a database and any defects (i.e. anomalies) are detected using the auto encoder model that scores the new data. Example
Techniques for Anomaly Detection
Companies around the world have used many different techniques to fight fraud in their markets. While the below list is not comprehensive, three anomaly detection techniques have been popular -
Visual Discovery - Anomaly detection can also be accomplished through visual discovery. In this process, a team of data analysts/business analysts etc. builds bar charts; scatter plots etc. to find unexpected behavior in their business. This technique often requires prior business knowledge in the industry of operation and a lot of creative thinking to use the right visualizations to find the answers.
Supervised Learning - Supervised Learning is an improvement over visual discovery. In this technique, persons with business knowledge in the particular industry label a set of data points as normal or anomaly. An analyst then uses this labelled data to build machine learning models that will be able to predict anomalies on unlabeled new data.
Unsupervised Learning - Another technique that is very effective but is not as popular is Unsupervised learning. In this technique, unlabeled data is used to build unsupervised machine learning models. These models are then used to predict new data. Since the model is tailored to fit normal data, the small number of data points that are anomalies stand out.
Some examples of unsupervised learning algorithms are -
Auto encoders – Unsupervised neural networks or auto encoders are used to replicate the input dataset by restricting the number of hidden layers in a neural network. A reconstruction error is generated upon prediction. Higher the reconstruction error, higher the possibility of that data point being an anomaly.
Clustering – In this technique, the analyst attempts to classify each data point into one of many pre-defined clusters by minimizing the within cluster variance. Models such as K-means clustering, K-nearest neighbors etc. used for this purpose. A K-means or a KNN model serves the purpose effectively since they assign a separate cluster for all those data points that do not look similar to normal data.
One-class support vector machine – In a support vector machine, the effort is to find a hyperplane that best divides a set of labelled data into two classes. For this purpose, the distance between the two nearest data points that lie on either side of the hyperplane is maximized. For anomaly detection, a One-class support vector machine is used and those data points that lie much farther away than the rest of the data are considered anomalies.
Time Series techniques – Anomalies can also be detected through time series analytics by building models that capture trend, seasonality and levels in time series data. These models are then used along with new data to find anomalies. Industry example
Autoencoders use unsupervised neural networks that are both similar to and different from a traditional feed forward neural network. It is similar in that it uses the same principles (i.e. Backpropagation) to build a model. It is different in that, it does not use a labelled dataset containing a target variable for building the model. An unsupervised neural network also known as an Auto encoder uses the training dataset and attempts to replicate the output dataset by restricting the hidden layers/nodes.
The focus on this model is to learn an identity function or an approximation of it that would allow it to predict an output that is similar the input. The identity function achieves this by placing restrictions on the number of hidden units in the data. For example, if we have 10 columns in a dataset (L1 in above diagram) and only five hidden units (L2 above), the neural network is forced to learn a more restricted representation of the input. By limiting the hidden units, we can force the model to learn a pattern in the data if there indeed exists one.
Not restricting the number of hidden units and instead specifying a ‘sparsity’ constraint on the neural network can also find an interesting structure.
Each of the hidden units can be either active or inactive and an activation function such as ‘tanh’ or ‘Rectifier’ can be applied to the input at these hidden units to change their state.
Some forms of auto encoders are as follows –
- Under complete Auto encoders
- Regularized Auto encoders
- Representational Power, Layer Size and Depth
- Stochastic Encoders and Decoders
- Denoising Auto encoders
A detailed explanation of each of these types of auto encoders is available here.
TIBCO Solutions for Anomaly Detection
Spotfire Template using H2O R package
TIBCO Spotfire’s Anomaly detection template uses an auto encoder trained in H2O for best in the market training performance. It can be configured with document properties on Spotfire pages and used as a point and click functionality.
Download the template from the Component Exchange. See documentation in the download distribution for details on how to use this template
Time Series Analysis
Using AI to detect complex anomalies in time series data
Here is a presentation on recent work using Deep Learning Autoencoders for Anomaly Detection in Manufacturing. The Spotfire Template for Anomaly Detection is used in this presentation. In a dynamic manufacturing environment, it may not be adequate to only look for known process problems, but also important to uncover and react to new, previously unseen patterns and problems as they emerge. Univariate and linear multivariate Statistical Process Control methods have traditionally been used in manufacturing to detect anomalies. With increasing equipment, process and product complexity, multivariate anomalies that also involve significant interactions and nonlinearities may be missed by these more traditional methods. This is a method for identifying complex anomalies using a deep learning autoencoder. Once the anomalies are detected, their fingerprints are generated so they can be classified and clustered, enabling investigation of the causes of the clusters. As new data streams in, it can be scored in real-time to identify new anomalies, assign them to clusters and respond to mitigate potential problems. These tools are no longer the exclusive province of data scientists. After an initial configuration, the method shown can be routinely employed by engineers who do not have deep expertise in data science. Click on the image below to watch a video of the presentation:
Anomalies and their component signatures in a time series dataset
Here are the slides used in the presentation:
Watch a demo showing how to use the Spotfire Time Series Anomaly Detection template
Click on the image below to see a demo of the Autoencoder deployed to our Hi Tech Manufacturing Accelerator for real-time monitoring:
Autoencoder Model deployed for real-time monitoring
Demo using Spotfire X's Python data function extension and TensorFlow
TIBCO Spotfire’s python data function enables users to use all packages available on PyPi to build custom functionality into their dashboards. Users can utilize document properties and data functions to execute custom code in python and use the results of the execution to update visualizations on a spotfire dashboard.
In this demo, we use the tensorflow python package to build a unsupervised neural network (a.k.a Autoencoder) to detect anomalies in manufacturing data. More information on the demo and information on access to the assets is available here
Demo using TIBCO Data Science and AWS Sagemaker for Distributed TensorFlow
TIBCO products can interact with the data on the cloud and build any type of neural networks using TensorFlow. Specifically, TIBCO Data science working with cloud resources like AWS allows users to build unsupervised neural networks for anomaly detection on data of any size.
In this example, we use AWS products (s3, EMR, Redshift and Sagemaker) to build an autoencoder using muiltiple nodes in a cluster. A video presentation of the demo is available here
Here are the slides
Do anomaly detection on AWS:
Autoencoders – Deep Learning book
Fraud Wiki page