TIBCO Statistica® Classification Trees
Classification trees are frequently used to explore the problem space. They are also known as decision trees and are considered a data mining technique. These trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables.
This module has two split methods:
- discriminant-based splits computed using quadratic discriminant analysis as in QUEST (quick, unbiased, efficient statistical trees - Loh & Shih, 1997)
- C&RT computed with grid search of all possible combinations of levels of the predictor variables (Breiman et. al., 1984)
The discriminant-based split is recommended for reliable and computational speed. C&RT was included to find splits with the best possible classification in the learning sample but not necessarily in independent cross-validation samples.
When using C&RT, there are three goodness of fit measurements available; Gini measure, Chi-square and G-square. The Gini measure was preferred by the developers of C&RT (Breiman et. al., 1984). The Chi-square measure is similar to the standard Chi-square value computed for the expected and observed classifications (with priors adjusted for misclassification cost). The G-square measure is similar to the maximum-likelihood Chi-square (as, for example, computed in the Log-Linear module with priors adjusted for misclassification cost.
Priors and misclassification costs can be specified as equal, estimated from the data, or user-specified. The user can also specify the v value for v-fold cross-validation during tree building, v value for v-fold cross-validation for error estimation, size of the SE rule, minimum node size before pruning, seeds for random number generation, and alpha value for variable selection. Integrated graphics options are provided to explore the input and output data.
Note: Statistica Expert Data Science and Statistica Enterprise have additional tree classification methods; Boosted Trees, Random Forests, C&RT (General Classification and Regression Tree Models), CHAID (Chi-square Automatic Interaction Detection), and MARSplines (Multivariate Adaptive Regression Splines).