9. Conclusion
The regression models discussed so far are of categorical dependent variables (binary, ordinal, and nominal
responses). An appropriate regression model is determined largely by the measurement level of a categorical dependent
variable of interest. The level of measurement should be, however, considered in conjunction with your theory and
research questions (Long 1997). You must also examine the data generation process (DGP) of a dependent variable to
understand its "behavior." Experienced researchers pay special attention to censoring, truncation, sample selection,
and other particular patterns of the DGP although these limited dependent variable issues are not addressed here.
Generally speaking, if your dependent variable is a binary variable, you may use the binary logit or probit regression
model. For ordinal responses, try to fit either ordered logit or probit regression model. If you have a nominal
response variable, investigate the DGP carefully and then choose one of the multinomial logit, conditional logit, and
nested logit models. In order to use the conditional logit and nested logit, you need to reshape the data set in
advance.
You should check key assumptions of a model when fitting the model. Examples are the parallel regression assumption in
ordered logit and probit models and the independence of irrelevant alternatives (IIA) assumption in the multinomial
logit model. You may respectively conduct the Brant test and Hausman test for these assumptions. If an assumption of
an ordered or nominal response model is violated, find alternative models or think carefully if a dependent variable
can be explored in a binary response model by dichotomizing the variable.
Since logit and probit models are nonlinear, their parameter estimates are difficult to interpret intuitively. The situation
becomes even worse in generalized ordered logit and multinomial logit models, where many parameter estimates and related
statistics are produced. Consequently, researchers need to spend more time and effort interpreting the results substantively.
Simply reporting parameter estimates and goodness-of-fit statistics is not sufficient. J. Scott Long (1997) and Long and Freese
(2003) provide good examples of meaningful interpretations using predicted probabilities, factor changes in odds, and marginal
effects (discrete changes) of predicted probabilities. It is highly recommended to visualize marginal effects and discrete
changes using a plot of predicted probabilities.
In general, logit and probit models require larger N than do linear regression models. Like the Bayesian estimation method, the
maximum likelihood estimation method depends on data. You need to check if you have sufficient valid observations especially
when your data contain many missing values. Scott Long's rule of thumb says 500 observations and at least additional 10 per
independent variable are required in ML estimation. If you have small N, DO NOT include a large number of independent
variables. This is the so called "small N and large parameter" problem; you may not be able to reach convergence in estimation
(you are just torturing SAS or Stata to get nothing) and/or may not get reliable results with desirable asymptotic ML
properties. What if 10 parameters are estimated on the basis of 50 observations? By contrast, an extremely large N, say
millions to estimate only two parameters, is not always a virtue since it absurdly boosts the statistical power of a test
without adding new information. Even a tiny effect, which should have been negligible in a normal situation, may be mistakenly
reported as statistically significant.
Regarding statistical software packages, I would recommend the SAS LOGISTIC, QLIM, and MDC procedures of SAS/ETS
(see Table 2.1 and 3.1). SAS also has PROC GENMOD and PROC PROBIT, but PROC LOGISTIC and PROC QLIM appear to be
best for binary and ordinal response models, and PROC MDC is good for nominal dependent variable models. ODS is
another advantage of using SAS. I also strongly recommend Stata since it provides handy ways to fit various models
and also can be assisted by SPost, which has various useful commands such as .fitstat, .prchange, .listcoef,
.prtab, and .prgen. I encourage SAS Institute to develop additional statements similar to, in particular, .prchange
and .prgen.
LIMDEP supports various regression models for categorical dependent variables addressed in Greene (2003) but does
not seem as user-friendly and stable as SAS and Stata. However, LIMDEP computes direct and indirect effects in the
recursive bivariate probit model and helps researchers interpret the result in more detail. You may benefits from
R's object-oriented programming concept and analyze data flexibly in your own way. SPSS is least recommended mainly
due to its limited support for categorical dependent variable models and messy syntax and output.
APPENDIX: Data Sets
The first data set is a subset of the 2000 and 2002 General Social Survey data provided by NORC
Download: csv | SAS | Stata (.dta) | Stata script (.do) | LIMDEP (.lpj)
Download: csv | SAS | Stata (.dta) | Stata script (.do) | LIMDEP (.lpj)
- trust: 1 if a respondent trust most people
- belief: Religious intensity: no religion (0) through strong (3)
- educate: respondent's education (years)
- income: family income ($1,000.00)
- age: respondent's age
- male: 1 for male and 0 for female
- www: 1 if a respondent have used WWW
The second data set travel on travel mode choice is adopted from Greene (2003). You may get the data from http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm
Download: Travel Mode (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
Download: Travel Mode (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
- subject: identification number
- mode: 1=Air, 2=Train, 3=Bus, 4=Car
- choice: 1 if the travel mode is chosen
- time: terminal waiting time, 0 for car
- cost: generalized cost measure
- income: household income
- air_inc: interaction of air flight and household income, air*income
- air: 1 for the air flight mode, 0 for others
- train: 1 for the train mode, 0 for others
- bus: 1 for the bus mode, 0 for others
- car: 1 for the car mode, 0 for others
- failure: failure time variable, 1-choice
. tab choice mode
|
mode
choice | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
0 | 152 147 180 151 | 630
1 | 58 63 30 59 | 210
-----------+--------------------------------------------+----------
Total | 210 210 210 210 | 840
choice | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
0 | 152 147 180 151 | 630
1 | 58 63 30 59 | 210
-----------+--------------------------------------------+----------
Total | 210 210 210 210 | 840
. sum time income air_inc
Variable |
Obs Mean Std.
Dev.
Min Max
-------------+--------------------------------------------------------
time | 840 34.58929 24.94861 0 99
income | 840 34.54762 19.67604 2 72
air_inc | 840 8.636905 17.91206 0 72
-------------+--------------------------------------------------------
time | 840 34.58929 24.94861 0 99
income | 840 34.54762 19.67604 2 72
air_inc | 840 8.636905 17.91206 0 72
References
- Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute.
- Cameron, A. Colin, and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press.
- Cameron, A. Colin, and Pravin K. Trivedi. 2009. Microeconometrics Using Stata. TX: Stata Press.
- Fu, V. Kang. 1998. "Estimating Generalized Ordered Logit Models," Stata Technical Bulletin, STB-44: 27-30.
- Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall.
- Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed. Plainview, New York: Econometric Software.
- Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using Stata, 2nd ed. College Station, TX: Stata Press.
- Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
- Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press.
- Park, Hun Myoung. 2004. "Presenting the Binary Logit/Probit Models Using the SAS/IML." Proceedings of the 15th Midwest SAS Users Group Conference in Chicago, IL (September 26-28, 2004).
- SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute.
- SPSS Inc. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.
- Stata Press. 2005. Stata Base Reference Manual, Release 9. College Station, TX: Stata Press.
- Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed. Cary, NC: SAS Institute.
- Williams, Richard, 2005. Glogit2: A Program for Generalized Logistic Regression/Partial Proportional Odds Models for Ordinal Dependent Variables. North American Stata Users' Groups Meeting 2005.
Acknowledgements
I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments
and suggestions. I also thank J. Scott Long in Sociology and David H. Good in the School of Public and Environmental Affairs,
Indiana University. A special thanks to many readers around the world who have eagerly provided constructive feedback and
encouraged me to keep improving this document.
Revision History
- 2003. 04 First draft.
- 2004. 07 Second draft.
- 2005. 09 Third draft (Added bivariate logit/probit and nested logit models).
- 2008. 10 Fourth draft (Added SAS ODS and selected SPSS output)
- 2009. 09 Fifth draft (Estimated models using different data and rewrote chapter 5, 6)



