The categorical variable here refers to a variable that is binary, ordinal, or nominal. Event count data are discrete (categorical) but often considered continuous. When the dependent variable is categorical, the ordinary least squares (OLS) method can no longer produce the best linear unbiased estimator (BLUE); that is, OLS is biased and inefficient. Consequently, researchers have developed various categorical dependent variable models (CDVMs). The nonlinearity of CDVMs makes it difficult to interpret outputs, since the effect of a change in a variable depends on the values of all other variables in the model (Long 1997).
1.1 Categorical Dependent Variable Models
In CDVMs, the left-hand side (LHS) variable or dependent variable is neither interval nor ratio, but rather categorical. The level of measurement and data generation process (DGP) of a dependent variable determines the proper type of CDVM. Thus, binary responses are modeled with the binary logit and probit regressions, ordinal responses are formulated into the ordered logit/probit regression models, and nominal responses are analyzed by multinomial logit, conditional logit, or nested logit models. Independent variables on the right-hand side (RHS) may be interval, ratio, or binary (dummy).
The CDVMs adopt the maximum likelihood (ML) estimation method, whereas OLS uses the moment based method. The ML method requires assumptions about probability distribution functions, such as the logistic function and the complementary log-log function. Logit models use the standard logistic probability distribution, while probit models assume the standard normal distribution. This document focuses on logit and probit models only. Table 1 summarizes CDVMs in comparison with OLS.
Table 1. Ordinary Least Squares and CDVMs
| |
Model |
Dependent (LHS) |
Estimation |
Independent (RHS) |
| OLS |
Ordinary least squares |
Interval or ratio scale |
Moment based method |
A linear function of interval/ratio or binary independent variables |
| CDVMs |
Binary response |
Binary (0 or 1) |
Maximum Likelihood Method |
| Ordinal response |
Ordinal (1st, 2nd, ...) |
| Nominal response |
Nominal (A, B, ...) |
| Event count data |
Count (0, 1, 2, ...) |
1.2 Logit Models versus Probit Models
How do logit models differ from probit models? The core difference lies in the distribution of errors. In the logit model, errors are assumed to follow the standard logistic distribution,

. The errors of the probit model are assumed to follow the standard normal distribution,

.
Figure 1. Comparison of the Standard Normal and Standard Logistic Probability Distributions
 |
 |
| PDF of the Standard Normal Distribtuion |
CDF of the Standard Normal Distribtuion |
 |
 |
| PDF of the Standard Logistic Distribtuion |
CDF of the Standard Logistic Distribtuion |
The probability density function (PDF) of the standard normal probability distribution has a higher peak and thinner tails than the standard logistic probability distribution (Figure 1). The standard logistic distribution looks as if someone has weighed down the peak of the standard normal distribution and strained its tails. As a result, the cumulative density function (CDF) of the standard normal distribution is steeper in the middle than the CDF of the standard logistic distribution and quickly approaches zero on the left and one on the right.
The two models, of course, produce different parameter estimates. In binary response models, the estimates of a logit model are roughly pi/sqrt(3) times larger than those of the corresponding probit model. These estimators, however, are almost the same in terms of the standardized impacts of independent variables and predictions (Long 1997).
In general, logit models reach convergence in estimation fairly well. Some (multinomial) probit models may take a long time to reach convergence, although the probit works well for bivariate models.
Top
1.3 Estimation in SAS, STATA, LIMDEP, and SPSS
SAS provides several procedures for CDVMs, such as LOGISTIC, PROBIT, GENMOD, QLIM, MDC, and CATMOD. Since these procedures support various models, a CDVM can be estimated by multiple procedures. For example, you may run a binary logit model using the LOGISTIC, PROBIT, GENMODE, and QLIM. The LOGISTIC and PROBIT procedures of SAS/STAT have been commonly used, but the QLIM and MDC procedures of SAS/ETS are noted for their advanced features.
Table 2. Procedures and Commands for CDVMs
| Model |
SAS 9.1 |
STATA 9.0 SE |
LIMDEP 8.0 |
SPSS 13.0 |
| OLS (Ordinary Least Squares) |
REG |
.regress |
Regress$ |
Regression |
| Binary |
Binary logit |
QLIM, GENMOD, LOGISTIC, PROBIT, CATMOD |
.logit, .logistic |
Logit$ |
Logistic regression |
| Binary Probit |
QLIM, GENMOD, LOGISTIC, PROBIT |
.probit |
Probit$ |
Probit |
| Bivariate |
Bivariate logit |
QLIM |
- |
- |
- |
| Bivariate Probit |
QLIM |
.biprobit |
Bivariateprobit$ |
- |
| Ordinal |
Ordered logit |
QLIM, LOGISTIC, PROBIT |
.ologit |
Ordered$, Logit$ |
Plum |
| Generalized logit |
- |
.gologit* |
- |
- |
| Ordered Probit |
QLIM, LOGISTIC, PROBIT |
.oprobit |
Ordered$ |
Plum |
| Nominal |
Multinomial logit |
CATMOD |
.mlogit |
Mlogit$, Logit$ |
Nomreg |
| Conditional Logit |
MDC, PHREG |
.clogit |
Clogit$, Logit$ |
Coxreg |
| Nested logit |
MDC |
.nlogit |
Nlogit$** |
- |
| Multinomial probit |
MDC |
.mprobit |
- |
- |
* User-written commands written by Fu (1998) and Williams (2005).
** The Nlogit$ command is supported by NLOGIT 3.0, which is sold separately.
The QLIM (Qualitative and LImited dependent variable Model) procedure analyzes various categorical and limited dependent variable regression models such as censored, truncated, and sample-selection models. This QLIM procedure also handles Box-Cox regression and bivariate probit and logit models. The MDC (Multinomial Discrete Choice) Procedure can estimate multinomial probit, conditional logit, and nested (multinomial) logit models.
Unlike SAS, STATA has individualized commands for corresponding CDVMs. For example, the .logit and .probit commands respectively fit the binary logit and probit models. The LIMDEP Logit$ and Probit$ commands support a variety of CDVMs that are addressed in Greene’s Econometric Analysis (2003). SPSS supports some related commands for CDVMs but has limited ability to analyze categorical data. Because of its limitation, SPSS outputs are skipped here. Table 2 summarizes the procedures and commands for CDVMs.
1.4 Long and Freese's SPost Module
STATA users may take advantages of user-written modules such as J. Scott Long and Jeremy Freese’s SPost. The module allows researchers to conduct follow-up analyses of various CDVMs including event count data models. See section 2.2 for major SPost commands.
In order to install SPost, execute the following commands consecutively. For more details, visit J. Scott Long’s Web site at
http://www.indiana.edu/~jslsoc/spost_install.htm.
. net from http://www.indiana.edu/~jslsoc/stata/
. net install spost9_ado, replace
. net get spost9_do, replace
If you want to use Vincent Kang Fu’s gologit (2000) and Richard Williams’ gologit2 (2005) for the generalized ordered logit model, type in the following.
. net search gologit
. net install gologit from(http://www.stata.com/users/jhardin)
. net install gologit2 from(http://fmwww.bc.edu/RePEc/bocode/g)