Gradient Boosting Machine Regression - Data Function for TIBCO Spotfire®

Gradient boosting is an ensemble-decision-tree, machine learning data function that’s useful to identify variables that best predict some outcome and build highly accurate predictive models.  For example, a retailer might use a gradient boosting algorithm to determine the propensity of customers to buy a product based on their buying histories.

Compatible Products

TIBCO Spotfire®

Provider

TIBCO Software

Supported Versions

  • TIBCO Spotfire 6.x
  • TIBCO Spotfire 7.x

License

TIBCO Component Exchange License

Overview

This data function focuses on regression models that have numeric responses. Predictors can be categorical or continuous. The function features automatic handling of nonlinear relationships and variable interactions, high prediction accuracy and automatic variable selection.  -  Download Software.

Learn more about using Data Functions

License Details

Review (1)
5
Michael OConnell 3:25pm 09/15/2016

Super to see GBM available as a downloadable Spotfire data function. Very robust method - can throw any collection of X variables at this.. due to its basis in recursive partitioning. Great predictions + rich summary structure including variable importance and interactions. Very useful as all around machine learning algorithm in many settings - regression, classification; many industry applications. Having GBM at your fingertips inside Spotfire's interactive viz and highly configurable environment is super powerful - a few clicks give gbm predictions and root cause variable analysis - right there in the interactive viz environment. Identifying best product offers for customers, equipment failure conditions, risky transactions - built into context like customer profiles, asset metadata. Good stuff !!

Gradient Boosting Machine Regression - Data Function for TIBCO Spotfire®

Purpose: This function provides the ability to use the CRAN gbm package within the Spotfire interface.  It is focused on Regression.  GBM will also do classification, but this is not addressed in this release.

GBM stands for Gradient Boosting Machines. It's a well-known machine learning technique with a number of advantages.

  • Automatic handling of nonlinear relationships
  • Automatic handling of variable interactions
  • High Prediction Accuracy
  • Automatic Variable Selection

The CRAN implementation also has simple automated handling of missing data. (in some cases, some preprocessing of missing values could improve results).

There are 3 files:

  • GBM Regression for TIBCO Spotfire Vx.sfd
  • GBM Regression for TIBCO Spotfire Vx.dxp
  • This README file

Installing the gbm package in Spotfire:

The package installs as usual from CRAN. There is one prerequisite: an installation of Java and the setting of JAVA_HOME.

If you are running locally (without Stat Services), you can use the Spotfire Tools/TERR Tools interface. You can also use the traditional install.packages("gbm") from the TERR command line. If you are using Stat Services, the latter method is the only way to install gbm, from TERR's command line on the server.

Installing the data function into your dxp containing your data:   GBM Regression for TIBCO Spotfire Vx.sfd  can be imported into a dxp directly using Tools/Register Data Functions or Insert/Data Function/From File . Note that most parameters are optional and will be assigned reasonable default values. This should make it easy to add this function to your own dxps. The provided dxp shows the use of most of the parameters. It can be used for analysis, or as a model for creating your own dxps.

Using the gbmRegression.dxp file with your own data:

The primary function of the .dxp file is to provide an example illustrating how the embedded data function could be wired up to your data in your own .dxp.  It is not intended to provide a complete analysis solution but you can replace the embedded data with your own data using the following procedure:

  1. The input in the dxp is the UserTable. You can start with the provided dxp and just replace the UserTable with your own data
  2. Go to the variable selection tab and select Predictors and Response
  3. Go to the GBM tab and enter Configuration Parameters, then click the Go button.

Model Inputs:

Name

Description

Type

Required

Data Types

predictors.df

Model Predictor columns

Table

Yes

Integer, Real, SingleReal, String, Date

response.df

Response Column

Column

Yes

Real

model.name

Enter a name

Value

No

String

n.trees

number of trees

Value

No

Integer

holdout.sample.size

number of holdout rows

Value

No

Integer

n.minobsinnode

Minimum Observations in Tree Nodes

Value

No

Integer

bag.fraction

Value between 0 and 1

Value

No

Real

model.path

directory path to store model

Value

No

String

learning.rate

Value between 0 and 1

Value

No

Real

interaction.depth

1=no interactions, 2=interactions between at most 2 variables, 3=interactions between at most 3 variables

Value

Yes

Integer

 

Notes

  1. Select one or more continuous or categorical predictor variables.
  2. Select one outcome variable. NOTE: the dxp assumes that 0/1 outcome should be analyzed as binomial, and that other numeric distributions indicate the response should be analyzed as Gaussian. Other choices can be made available by modifying the code.
  3. Fill in the name of the model you want to create. If you reuse a name, the older version will be overwritten.
  4. Number of Trees to Build:   GBM is an ensemble model which builds many tree models in sequence; each one attempts to improve the fit by analyzing the residuals of the trees build so far. Too few trees leave out some details that could improve the fit to the training data (Underfitting) Too many trees can overfit, which results in less accuracy when extrapolating to new data (Holdout Sample). Typically we want the best results on a holdout sample, and will tune the number of trees accordingly.
  5. Holdout Sample Size:   If there is sufficient data, it is good to use 20-50% of your data for tuning the model. Specify the number of rows here. If you specify Zero, gbm will use the unused data from each step to estimate the prediction error (OOB estimate).
  6. Interaction Depth: Specify the number of interactions to search for.
    • 1=no interactions
    • 2=interactions between at most 2 variables
    • 3=interactions between at most 3 variables
  7. Minimum Observations in Tree Nodes: Terminal nodes must contain at least this many observations or be disqualified for inclusion in the model
  8. Bag Fraction: To add variability, each step uses just this proportion of the data. Smaller values build trees faster but generally require more trees. Value between 0 and 1
  9. Model Path - where to store the completed model on disk
  10. Learning Rate: Prevent overshooting the right values by approaching it more slowly with a lower learning rate; or speed things up with a higher learning rate. Value between 0 and 1

 

Model Outputs              

Name

Description

Type

valid.error

Sample Error by number of trees

Table

best.ntrees

Best number of trees message

Value

best.ntrees.int

Best number of trees

Value

model.quality

RMSE

Value

importance.table

Variable Importance Table

Table

msg

Started and Completed Date/Times

Value

Screenshot of example .dxp page showing many of the model inputs and outputs

View and Download this Data function on the Exchange

View the Wiki Page