Data Function for TIBCO® Data Science - Team Studio in TIBCO Spotfire®
This data function enables users to execute a TIBCO® Data Science - Team Studio workflow from Spotfire.
TIBCO Spotfire® TIBCO® Data Science
The following requirements must be met to enable running Team Studio workflows from the data function extension: 1. Spotfire 7.13 or 10.x client and server 2. Latest copy of TeamStudioCore*.spk and TeamStudioForms*.spk 3. Data Science Team Studio 6.4 or 6.5 Instance 4. Data source set up in Data Science Team Studio 6.4 or 6.5 instance. (Hive is used in this example)
The following requirements must be met to enable running Team Studio workflows from the data function extension:
1. Spotfire 7.13 or 10.x client and server
2. Latest copy of TeamStudioCore*.spk and TeamStudioForms*.spk
3. Data Science Team Studio 6.4 or 6.5 Instance
4. Data source set up in Data Science Team Studio 6.4 or 6.5 instance. (Hive is used in this example)
TIBCO Component Exchange License
Data Function for TIBCO® Data Science - Team Studio in TIBCO Spotfire® enables users to execute a TIBCO® Data Science - Team Studio workflow from Spotfire. Users can utilize document properties and Team Studio data functions to execute workflows and bring back the results to update the Spotfire visualizations dashboard.
A demo is included that provides an example showing how this data function can be used. This demo uses a credit-scoring workflow to solve a business problem in the insurance industry.
For more information on TIBCO® Data Science - Team Studio, view this Community Wiki page
The Data Science Team did such a great with this! This is something that is our first step to deeper integration with other TIBCO products and satisfies a common customer need.
Very exciting to see big data, advanced analytics capabilities tightly integrated with Spotfire like this!
Data Function for TIBCO® Data Science - Team Studio in TIBCO Spotfire®
“Data Function for TIBCO® Data Science - Team Studio in TIBCO Spotfire®” enables users to execute a workflow in TIBCO® Data Science - Team Studio from Spotfire. Users can utilize document properties and Team Studio data functions to execute workflows and bring back the results to update the Spotfire visualizations dashboard.
A demo is included that provides an example showing how this data function can be used. The demo uses a “credit-scoring” workflow to solve a business problem in the insurance industry.
Check out this video for more information
The following are the requirements to enable the Data Function in Spotfire:
- Spotfire 7.13 (or later) client and server
- Latest copy of
- Data Science Team Studio 6.4 instance
- Data Source (Hive is used in this example) set up in the Data Science Team Studio 6.4 instance
Custom “Data Function for TIBCO(R) Data Science - Team Studio in TIBCO Spotfire(R)” is available from the TIBCO Exchange here.
In order to add the Custom Data Function to the client software, the above packages must be deployed to a Spotfire server and the client software is needed to log into the deployment area containing the .spk’s in order to be updated. See here for details.
Credit Scoring Usecase
The dataset contains 20 columns of data that describes the various socio-economic metrics of multiple customers. Each row from this data is an application by a customer to a financial institution requesting for a credit. Various metrics from this data can help us to identify all the credit worthy customers.
Disclaimer: The data used in this demo is likely fictitious and has been created for the purpose of the demo.
Connecting Team Studio to Data Sources
In this example, we use Hive as the data source to store our raw data and the outputs from the operators in the workflow. See here for details.
Team Studio Workflow Walkthrough
The credit scoring workflow in Team Studio is created using multiple operators with each operator having its own functionality as described below-
Alpine Forest Classification operator
An Alpine Forest Classification model is an ensemble classification method of creating a collection of decision trees with controlled variation. Ensemble modeling is the application of many models, each operating on a subset of the data. It is also sometimes referred to as model-averaging or "bagging".
Alpine Forest Classification modeling is considered to be one of the most accurate learning algorithms currently available, producing highly accurate categorical classification results. More information is available here
Here, we use Alpine Forest to perform feature selection. The reason is tree-based strategies used by alpine forest naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees (called gini impurity). Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.
Random Sampling operator
The Random Sampling operator extracts data rows from the input dataset and generates sample tables/views according to the sample properties (percentage or row count) specified by the user. More information is available here
A Model Results Operator that follows most models which predicts a value as a result. This can be used for any of the model operators. More information is available here
The Predictor Operator applies the input regression, classification, or clustering model to the input dataset in order to predict a value (or the highest probability value):
- The input dataset must contain the columns such that the names are the same as the columns in the dataset selected for model training with the exception of the dependent column.
- The prediction operation will output its prediction columns with the columns of the input dataset into a prediction table specified by user.
- The operator will include the following prediction columns in the output table specified by user.
- PRED_<model_abbreviation> – the predicted value or value with highest probability
- CONF_<model_abbreviation> – the confidence in the predicted value
- INFO_<model_abbreviation> – a dictionary of information about the results
Confusion Matrix operator
The Confusion Matrix Operator is a classification model evaluation operator. It is more graphic in nature and displays information about actual vs. predicted counts of a classification model and helps assess the model's accuracy for each of the possible class value. More information is available here
Alpine Forest Evaluator operator
The Alpine Forest Evaluator Operator is an Alpine Forest model evaluation operator. It is more graphic in nature. It provides model accuracy data, a confusion matrix heat map that illustrates the classification model's accuracy for each possible predicted value, and an error convergence rate graph.
Alpine Forest Evaluator Operator works for Alpine Forest Classification flows only. More information is available here
Load to Hive operator
The “Load to Hive” operator provides a mechanism for saving a table directly to a Hive database. You may use any tabular Hadoop file as the input, as long as the data source is configured for the chosen Hive Hadoop source. More information is available here
Export to SBDF (HD)
The "Export to SBDF" operator converts an HDFS tabular data set to the Spotfire binary data frame (SBDF) format. The SBDF files are stored in the same workspace as the workflow and can be downloaded for use in TIBCO Spotfire.. More information is available here
Connecting Spotfire to Team Studio Data Sources
The custom data function allows the users to execute the Team Studio credit scoring workflow from the Spotfire and the raw data and results from the executed workflow are stored in the Hive data source in the Team Studio instance. Spotfire then uses the “Cloudara Impala” connector to connect to these raw data and the results in the Hive data source and updates the visualizations using these updated results. Here, the results are aggregations that happen in the Team Studio workflow operators.
The first page of the dxp contains information related to the usecase and tools, technologies and techniques used for solving the problem.
The ‘Data’ page contains information about the columns in the credit scoring dataset and a table visualization showing the raw data.
The ‘Model Evaluation page contains four sections – Input parameters (top left), Variable Importance (top right), Actuals vs Predicted (bottom Left) & Probability Distribution (bottom right).
The section on the top left allows users to enter input parameters to tune the alpine forest classifier to build an optimized tree for best predictions, “Number of Trees”, “Min Size for Split” and “Min leaf Size”. These values are then passed to the Team Studio workflow for execution.
The bar chart allows the users to understand the difference in the actual prediction versus the predictions from the alpine forest model. The credit ratings displayed here are either bad or good distributed by gender.
Variable Importance chart helps the users to identify the most important features or variables for predicting the credit rating of the customers. This chart is built using a table that has been brought back from the Team Studio workflow as part of the outputs in Team Studio Data Function. The goal here is to create a model that only includes the most important features using a statistical analysis called “feature selection”. Here, we use Alpine Forest to perform feature selection. This has three benefits. First, we make our model simpler to interpret. Second, we can reduce the variance of the model, and therefore overfitting. Finally, we can reduce the computational cost (and time) of training a model.