TIBCO Statistica® Generalized Linear Nonlinear Models
Generalized Linear/ Nonlinear Models (GLZ) is an implementation of the general linear model. Both linear and nonlinear effects for any number and type of predictor variables on a discrete or continuous dependent variable can be analyzed. Designs can include multiple-degrees-of-freedom effects for categorical predictor variables, single-degree-of-freedom effects for continuous predictor variables, or any combination of effects for continuous and categorical predictor variables. GLZ also implements stepwise and best-subset model-building techniques for any type of design. GLZ uses the maximum likelihood (ML) methods of the generalized linear model to build models and to estimate and test hypotheses about effects in the model.
In its simplest form, a linear model specifies the (linear) relationship between a dependent (or response) variable Y, and a set of predictor variables, the X's, so that
Y = b0 + b1X1 + b2X2 + ... + bkXk
In this equation b0 is the regression coefficient for the intercept and the b1 values are the regression coefficients (for variables 1 through k) computed from the data.
For example, one could estimate (i.e., predict) a person's weight as a function of the person's height and gender. You could use linear regression to estimate the respective regression coefficients from a sample of data, measuring height, weight, and observing the subjects' gender. For many data analysis problems, estimates of the linear relationships between variables are adequate to describe the observed data, and to make reasonable predictions for new observations.
However, there are relationships that cannot adequately be summarized by a simple linear equation for two major reasons:
- Distribution of dependent variable. The dependent variable of interest may have a non-continuous distribution. Therefore the predicted values should also follow the respective distribution; any other predicted values are not logically possible.
For example, a researcher may be interested in predicting one of three possible discrete outcomes (e.g., a consumer's choice of one of three alternative products). In that case, the dependent variable can only take on 3 distinct values, and the distribution of the dependent variable is said to be multinomial.
Or suppose you are trying to predict people's family planning choices, specifically, how many children families will have, as a function of income and various other socioeconomic indicators. The dependent variable -- number of children -- is discrete (i.e., a family may have 1, 2, or 3 children and so on, but cannot have 2.4 children), and most likely the distribution of that variable is highly skewed (i.e., most families have 1, 2, or 3 children, fewer will have 4 or 5, very few will have 6 or 7, and so on). In this case it would be reasonable to assume that the dependent variable follows a Poisson distribution.
- Link function. A second reason why the linear (multiple regression) model might be inadequate to describe a particular relationship is that the effect of the predictors on the dependent variable may not be linear in nature.
For example, the relationship between a person's age and various indicators of health is most likely not linear in nature. During early adulthood, the (average) health status of people who are 30 years old as compared to the (average) health status of people who are 40 years old is not markedly different. However, the difference in health status of 60 year old people and 70 year old people is probably greater. Thus, the relationship between age and health status is likely non-linear in nature. Probably some kind of a power function would be adequate to describe the relationship between a person's age and health, so that each increment in years of age at older ages will have greater impact on health status, as compared to each increment in years of age during early adulthood. Put in other words, the link between age and health status is best described as non-linear, or as a power relationship in this particular example.
Different methods for automatic model building are available. Specifically, forward entry, backward removal, forward stepwise, and backward stepwise procedures can be performed, as well as best-subset search procedures. In forward methods of selection of effects to include in the model (i.e., forward entry and forward stepwise methods), score statistics are compared to select new (significant) effects. The Wald statistic can be used for backward removal methods (i.e., backward removal and backward stepwise, when effects are selected for removal from the model).
The best subsets search method can be based on three different test statistics: the score statistic, the model likelihood, and the AIC (Akaike Information Criterion). Note that, since the score statistic does not require iterative computations, best subset selection based on the score statistic is computationally fastest, while selection based on the other two statistics usually provides more accurate results.
For additional information about generalized linear model, see Dobson (1990), Green and Silverman (1994), or McCullagh and Nelder (1989).
For additional information about AIC see Akaike, 1973.
For additional information on the test statistics used by the best subset search method, see McCullagh and Nelder(1989).