Gradient Boosting Machine Regression - Data Function for TIBCO Spotfire®
Gradient boosting is an ensemble-decision-tree, machine learning data function that’s useful to identify variables that best predict some outcome and build highly accurate predictive models. For example, a retailer might use a gradient boosting algorithm to determine the propensity of customers to buy a product based on their buying histories.
Compatible Products
TIBCO Spotfire®
Provider
TIBCO SoftwareSupported Versions
- TIBCO Spotfire 6.x
- TIBCO Spotfire 7.x
License
TIBCO Component Exchange License
Overview
This data function focuses on regression models that have numeric responses. Predictors can be categorical or continuous. The function features automatic handling of nonlinear relationships and variable interactions, high prediction accuracy and automatic variable selection. - Download Software.
Learn more about using Data Functions
License Details
Release(s)
Release v1.2
Published: June 2016
Initial release
Super to see GBM available as a downloadable Spotfire data function. Very robust method - can throw any collection of X variables at this.. due to its basis in recursive partitioning. Great predictions + rich summary structure including variable importance and interactions. Very useful as all around machine learning algorithm in many settings - regression, classification; many industry applications. Having GBM at your fingertips inside Spotfire's interactive viz and highly configurable environment is super powerful - a few clicks give gbm predictions and root cause variable analysis - right there in the interactive viz environment. Identifying best product offers for customers, equipment failure conditions, risky transactions - built into context like customer profiles, asset metadata. Good stuff !!
Gradient Boosting Machine Regression - Data Function for TIBCO Spotfire®
Purpose: This function provides the ability to use the CRAN gbm package within the Spotfire interface. It is focused on Regression. GBM will also do classification, but this is not addressed in this release.
GBM stands for Gradient Boosting Machines. It's a well-known machine learning technique with a number of advantages.
- Automatic handling of nonlinear relationships
- Automatic handling of variable interactions
- High Prediction Accuracy
- Automatic Variable Selection
The CRAN implementation also has simple automated handling of missing data. (in some cases, some preprocessing of missing values could improve results).
There are 3 files:
- GBM Regression for TIBCO Spotfire Vx.sfd
- GBM Regression for TIBCO Spotfire Vx.dxp
- This README file
Installing the gbm package in Spotfire:
The package installs as usual from CRAN. There is one prerequisite: an installation of Java and the setting of JAVA_HOME.
If you are running locally (without Stat Services), you can use the Spotfire Tools/TERR Tools interface. You can also use the traditional install.packages("gbm") from the TERR command line. If you are using Stat Services, the latter method is the only way to install gbm, from TERR's command line on the server.
Installing the data function into your dxp containing your data: GBM Regression for TIBCO Spotfire Vx.sfd can be imported into a dxp directly using Tools/Register Data Functions or Insert/Data Function/From File . Note that most parameters are optional and will be assigned reasonable default values. This should make it easy to add this function to your own dxps. The provided dxp shows the use of most of the parameters. It can be used for analysis, or as a model for creating your own dxps.
Using the gbmRegression.dxp file with your own data:
The primary function of the .dxp file is to provide an example illustrating how the embedded data function could be wired up to your data in your own .dxp. It is not intended to provide a complete analysis solution but you can replace the embedded data with your own data using the following procedure:
- The input in the dxp is the UserTable. You can start with the provided dxp and just replace the UserTable with your own data
- Go to the variable selection tab and select Predictors and Response
- Go to the GBM tab and enter Configuration Parameters, then click the Go button.
Model Inputs:
Name |
Description |
Type |
Required |
Data Types |
predictors.df |
Model Predictor columns |
Table |
Yes |
Integer, Real, SingleReal, String, Date |
response.df |
Response Column |
Column |
Yes |
Real |
model.name |
Enter a name |
Value |
No |
String |
n.trees |
number of trees |
Value |
No |
Integer |
holdout.sample.size |
number of holdout rows |
Value |
No |
Integer |
n.minobsinnode |
Minimum Observations in Tree Nodes |
Value |
No |
Integer |
bag.fraction |
Value between 0 and 1 |
Value |
No |
Real |
model.path |
directory path to store model |
Value |
No |
String |
learning.rate |
Value between 0 and 1 |
Value |
No |
Real |
interaction.depth |
1=no interactions, 2=interactions between at most 2 variables, 3=interactions between at most 3 variables |
Value |
Yes |
Integer |
Notes
- Select one or more continuous or categorical predictor variables.
- Select one outcome variable. NOTE: the dxp assumes that 0/1 outcome should be analyzed as binomial, and that other numeric distributions indicate the response should be analyzed as Gaussian. Other choices can be made available by modifying the code.
- Fill in the name of the model you want to create. If you reuse a name, the older version will be overwritten.
- Number of Trees to Build: GBM is an ensemble model which builds many tree models in sequence; each one attempts to improve the fit by analyzing the residuals of the trees build so far. Too few trees leave out some details that could improve the fit to the training data (Underfitting) Too many trees can overfit, which results in less accuracy when extrapolating to new data (Holdout Sample). Typically we want the best results on a holdout sample, and will tune the number of trees accordingly.
- Holdout Sample Size: If there is sufficient data, it is good to use 20-50% of your data for tuning the model. Specify the number of rows here. If you specify Zero, gbm will use the unused data from each step to estimate the prediction error (OOB estimate).
- Interaction Depth: Specify the number of interactions to search for.
- 1=no interactions
- 2=interactions between at most 2 variables
- 3=interactions between at most 3 variables
- Minimum Observations in Tree Nodes: Terminal nodes must contain at least this many observations or be disqualified for inclusion in the model
- Bag Fraction: To add variability, each step uses just this proportion of the data. Smaller values build trees faster but generally require more trees. Value between 0 and 1
- Model Path - where to store the completed model on disk
- Learning Rate: Prevent overshooting the right values by approaching it more slowly with a lower learning rate; or speed things up with a higher learning rate. Value between 0 and 1
Model Outputs
Name |
Description |
Type |
valid.error |
Sample Error by number of trees |
Table |
best.ntrees |
Best number of trees message |
Value |
best.ntrees.int |
Best number of trees |
Value |
model.quality |
RMSE |
Value |
importance.table |
Variable Importance Table |
Table |
msg |
Started and Completed Date/Times |
Value |
Screenshot of example .dxp page showing many of the model inputs and outputs
View and Download this Data function on the Exchange