Cost Sensitive Classification in TIBCO’s Risk Management Accelerator
TIBCO’s Risk Management Accelerator is a ready-made template to predict and assess risk in any business or analytics context. The concept of risk appears in various verticals; this ranges from the risk of damaged products/machine failure in transportation and manufacturing to the risk of patient wellbeing in healthcare. Utilizing TIBCO Spotfire and TIBCO Streaming, this accelerator builds an end-to-end system that analyzes and detects anomalies with machine learning models.. We’ve now added a key feature to the analysis: cost sensitive classification.
About Cost Sensitive Classification
We focus on class-based costs (ex. The cost for identifying a risky event is __ and the cost of missing a risky event is __) as opposed to costs defined by specific data attributes (ex. The cost of this data point is ___). Class based costs depend only on the true class and the predicted class, which together can be summarized as a cost matrix. N.B.: Elkan (The_Foundations_of_Cost-Sensitive_Learning) warns about how easy it is to make errors in specifying the cost matrix. He recommends conceptualizing outcomes in terms of benefits rather than costs, to help avoid logical contradictions in the cost matrix.
Visualize and Understand the Data
The default data in the Spotfire template is a risk dataset on credit card transactions. This dataset contains transaction information per various customers. Credit card companies must be able to quickly detect and address unauthorized transactions to retain satisfied customers. The column ‘target’ indicates whether or not each transaction is fraud, labeled with a ‘1’, or not fraud, labeled with a ‘0’. 6% of the data - or 3700 cases - are reported to be fraud, and there are 59000 cases that are not fraud.
We can view how our data fields associate with the target variable, and how they associate amongst themselves. From the map and categorical distribution chart, we can tell that the majority of transactions in our dataset are in the United States. A numeric variable like the normalized age of an account plotted against our target tells us that fraudulent transactions are mainly distributed in younger accounts compared to non-fraudulent transactions.
We can view how categorical variables are correlated with one another and how numeric variables are correlated with our target. Here, account24hours is the most positively correlated variable with our target. This variable is the number of times a customer’s account was accessed in the last 24 hours and as that number increases so does the occurrence of fraudulent transactions. These explorations help discern which variables to include in our models.
Create and Evaluate a Supervised Learning Model
We use random forests, a supervised learning technique, to find patterns in nonlinear data and to help handle unbalanced and high dimensional data.. Here we have selected to use all of our data fields as predictors to predict whether target = 1 (i.e. whether a transaction is fraudulent). We can view the top predictors from the model and the corresponding breakdown of how that predictor’s distribution contributes to the probability of being a target in the partial dependence plot.
Partial dependence plots show the marginal effect one or more features has on a model’s predictions. We used the R pdp library to calculate a single feature’s effect on the predicted probabilities from the random forest model. We also suggest the use of accumulated local effects plots in place of partial dependence plots. Partial dependence plots get skewed when features are correlated; accumulated local effects plots are faster and unbiased.
To assess the supervised model results, we look at the ROC plot and confusion matrix. The confusion matrix is based on a heuristic cutoff from 20% of the data that we test on. Since we are working with a binary classification problem - predicting the target is 0 or 1 - the random forest outputs a probability between 0 to 1 and by default anything above a 0.5 threshold is categorized as fraudulent and anything below a 0.5 threshold is categorized as not fraudulent. We could use 0.5 as the cutoff, but in practice that may not be helpful. What if we turn customers away by investigating too many low-risk events, or on the other end, what if we neglect too many high-risk events? This is why we incorporated cost sensitive classification and offer different options for configuring these models that will eventually be set into production.
We will reevaluate the deterministic 0.5 threshold. There’s a few different options here to choose thresholds for whether or not a new case is fraudulent. You can choose the threshold for maximum F1; a F1 score closer to 1 means we have a more accurate model that has low false positives and low false negatives. In the Optimal Threshold Chart, you can choose a metric on the Y-axis that varies with our predicted probability. In the Cost Function chart, you can define your own costs for a false positive (incorrectly labeling a case as fraud) and for a false negative (missing a fraud).
This can literally be a dollar cost or relative costs. For example, we can set the cost of incorrectly labeling a transaction as fraud lower, $1, and the cost of missing a fraud higher, $10 (or 10 times more expensive). We can adjust these numbers as needed and either take the minimum cost cutoff or choose from the chart where our cost starts to increase (i.e. where the cost is still relatively low but we want a higher threshold). The resulting confusion matrix and F1 score conveniently update based on the marked threshold.
Create and Evaluate an Unsupervised Learning Model
The accelerator also has an unsupervised model based on Principal Component Analysis or PCA to find anomalous transactions when no target variable is provided in the dataset. PCA is a dimension reduction technique that projects data into a lower dimension space while maximizing variance in the transformed data fields. It effectively reduces the number of variables in our data while retaining as much information in the newly transformed data as possible. These new data fields are called principal components or factors. PCA is also used for exploratory data analysis. Most of the information from the original features is compressed into the first components. That’s why we explore how our data and features fall along the first two principal components and visualize the first components to get an idea of how the new factor space arranges points (in the 2D scatterplots).
The PCA analysis is run on as many components as possible, but we want to fix the upper limit on the number of components. This means we want to decide how much to reduce our data or how many components to use in our production model. The elbow in the factor plot from the PCA analysis indicates a plateau in the information retained from our original data; here the elbow is at 10 components. In the center visualization, each point is a transaction from the new data. Cases that fall far away from clusters are potential indications of fraud. This is because our PCA analysis has compressed regular data into similar spaces but irregular data might fall elsewhere.
To categorize a case as fraud or not, we define oddity as the distance between a point and the origin (since the data was normalized for PCA). Hitting the ‘calculate oddity’ button calculates this metric for each point in the background. In the model results, we can again determine the oddity threshold based on the distribution of distances we just calculated. It appears that the predicted oddity distances diminish after 35 or so; this is what we’ll set as the unsupervised model threshold. This threshold is used to differentiate anomalous transactions (above the threshold) from regular ones (below the threshold).
Both the models we trained can be deployed to Tibco Streaming where they are used to label new, real-time data without any further data labelling or human intervention. The models are deployed to an AMS server and subsequently into a running streambase engine. We’ll start a simulation that mimics new transaction data which flows through our system. We can use Tibco Live View or our DXP to watch as our models score new data. We clearly see which transactions are flagged as fraudulent or anomalous as the model scores them above our set thresholds. The DXP tracks cases flagged and prompts them for further investigation.
The risk accelerator allows non-experts from any industry to detect risk events in real-time using Spotfire and Streaming. We also tested on heart disease risk data and our cost function setup was key to design models given the high stakes of patient safety! The analytics tools provided in this accelerator allow for a customizable data science solution for any risk detection.
Elkan, Charles. (2001). The Foundations of Cost-Sensitive Learning. Proceedings of the Seventeenth International Conference on Artificial Intelligence: 4-10 August 2001; Seattle. 1.
Molnar, C. (2021, May 03). Interpretable machine learning. Retrieved May 06, 2021, from https://christophm.github.io/interpretable-ml-book/pdp.html
Partial: Partial Dependence Functions. (n.d.). Retrieved May 06, 2021, from https://www.rdocumentation.org/packages/pdp/versions/0.7.0/topics/partial
|Sweta Kotha is a Data Scientist at TIBCO and a recent graduate from Carnegie Mellon University. Her experience spans data science, natural language processing, and biostatistics. She likes trying out new technologies and methods to address analytics challenges and is interested in effectively communicating with data. She enjoys reading, running, and traveling.|
|David Katz is a Principal Consultant at TIBCO. With a long career in data analysis, model building and statistical consulting, David enjoys tackling challenging problems with real-world benefits, in particular using advanced regression methods and making the invisible visible. The most fun is the variety of applications he has been able to work with, from Formula One racing to marketing and operations. In his spare time he likes to bike, hike and do yoga.|