9. Conclusion


The regression models discussed so far are of categorical dependent variables (binary, ordinal, and nominal responses). An appropriate regression model is determined largely by the measurement level of a categorical dependent variable of interest. The level of measurement should be, however, considered in conjunction with your theory and research questions (Long 1997). You must also examine the data generation process (DGP) of a dependent variable to understand its "behavior." Experienced researchers pay special attention to censoring, truncation, sample selection, and other particular patterns of the DGP although these limited dependent variable issues are not addressed here.

Generally speaking, if your dependent variable is a binary variable, you may use the binary logit or probit regression model. For ordinal responses, try to fit either ordered logit or probit regression model. If you have a nominal response variable, investigate the DGP carefully and then choose one of the multinomial logit, conditional logit, and nested logit models. In order to use the conditional logit and nested logit, you need to reshape the data set in advance.

You should check key assumptions of a model when fitting the model. Examples are the parallel regression assumption in ordered logit and probit models and the independence of irrelevant alternatives (IIA) assumption in the multinomial logit model. You may respectively conduct the Brant test and Hausman test for these assumptions. If an assumption of an ordered or nominal response model is violated, find alternative models or think carefully if a dependent variable can be explored in a binary response model by dichotomizing the variable.

Since logit and probit models are nonlinear, their parameter estimates are difficult to interpret intuitively. The situation becomes even worse in generalized ordered logit and multinomial logit models, where many parameter estimates and related statistics are produced. Consequently, researchers need to spend more time and effort interpreting the results substantively. Simply reporting parameter estimates and goodness-of-fit statistics is not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful interpretations using predicted probabilities, factor changes in odds, and marginal effects (discrete changes) of predicted probabilities. It is highly recommended to visualize marginal effects and discrete changes using a plot of predicted probabilities.

In general, logit and probit models require larger N than do linear regression models. Like the Bayesian estimation method, the maximum likelihood estimation method depends on data. You need to check if you have sufficient valid observations especially when your data contain many missing values. Scott Long's rule of thumb says 500 observations and at least additional 10 per independent variable are required in ML estimation. If you have small N, DO NOT include a large number of independent variables. This is the so called "small N and large parameter" problem; you may not be able to reach convergence in estimation (you are just torturing SAS or Stata to get nothing) and/or may not get reliable results with desirable asymptotic ML properties. What if 10 parameters are estimated on the basis of 50 observations? By contrast, an extremely large N, say millions to estimate only two parameters, is not always a virtue since it absurdly boosts the statistical power of a test without adding new information. Even a tiny effect, which should have been negligible in a normal situation, may be mistakenly reported as statistically significant.

Regarding statistical software packages, I would recommend the SAS LOGISTIC, QLIM, and MDC procedures of SAS/ETS (see Table 2.1 and 3.1). SAS also has PROC GENMOD and PROC PROBIT, but PROC LOGISTIC and PROC QLIM appear to be best for binary and ordinal response models, and PROC MDC is good for nominal dependent variable models. ODS is another advantage of using SAS. I also strongly recommend Stata since it provides handy ways to fit various models and also can be assisted by SPost, which has various useful commands such as .fitstat, .prchange, .listcoef, .prtab, and .prgen. I encourage SAS Institute to develop additional statements similar to, in particular, .prchange and .prgen.

LIMDEP supports various regression models for categorical dependent variables addressed in Greene (2003) but does not seem as user-friendly and stable as SAS and Stata. However, LIMDEP computes direct and indirect effects in the recursive bivariate probit model and helps researchers interpret the result in more detail. You may benefits from R's object-oriented programming concept and analyze data flexibly in your own way. SPSS is least recommended mainly due to its limited support for categorical dependent variable models and messy syntax and output.

Top


APPENDIX: Data Sets


The first data set is a subset of the 2000 and 2002 General Social Survey data provided by NORC

Download: csv | SAS | Stata (.dta) | Stata script (.do) | LIMDEP (.lpj)

  • trust: 1 if a respondent trust most people
  • belief: Religious intensity: no religion (0) through strong (3)
  • educate: respondent's education (years)
  • income: family income ($1,000.00)
  • age: respondent's age
  • male: 1 for male and 0 for female
  • www: 1 if a respondent have used WWW

The second data set travel on travel mode choice is adopted from Greene (2003). You may get the data from http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm

Download: Travel Mode (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
  • subject: identification number
  • mode: 1=Air, 2=Train, 3=Bus, 4=Car
  • choice: 1 if the travel mode is chosen
  • time: terminal waiting time, 0 for car
  • cost: generalized cost measure
  • income: household income
  • air_inc: interaction of air flight and household income, air*income
  • air: 1 for the air flight mode, 0 for others
  • train: 1 for the train mode, 0 for others
  • bus: 1 for the bus mode, 0 for others
  • car: 1 for the car mode, 0 for others
  • failure: failure time variable, 1-choice

. tab choice mode

           |                    mode
    choice |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
         0 |       152        147        180        151 |       630
         1 |        58         63         30         59 |       210
-----------+--------------------------------------------+----------
     Total |       210        210        210        210 |       840

. sum time income air_inc

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        time |       840    34.58929    24.94861          0         99
      income |       840    34.54762    19.67604          2         72
     air_inc |       840    8.636905    17.91206          0         72

Top


References

  • Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute.
  • Cameron, A. Colin, and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press.
  • Cameron, A. Colin, and Pravin K. Trivedi. 2009. Microeconometrics Using Stata. TX: Stata Press.
  • Fu, V. Kang. 1998. "Estimating Generalized Ordered Logit Models," Stata Technical Bulletin, STB-44: 27-30.
  • Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall.
  • Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed. Plainview, New York: Econometric Software.
  • Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using Stata, 2nd ed. College Station, TX: Stata Press.
  • Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
  • Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press.
  • Park, Hun Myoung. 2004. "Presenting the Binary Logit/Probit Models Using the SAS/IML." Proceedings of the 15th Midwest SAS Users Group Conference in Chicago, IL (September 26-28, 2004).
  • SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute.
  • SPSS Inc. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.
  • Stata Press. 2005. Stata Base Reference Manual, Release 9. College Station, TX: Stata Press.
  • Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed. Cary, NC: SAS Institute.
  • Williams, Richard, 2005. Glogit2: A Program for Generalized Logistic Regression/Partial Proportional Odds Models for Ordinal Dependent Variables. North American Stata Users' Groups Meeting 2005.


Acknowledgements


I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions. I also thank J. Scott Long in Sociology and David H. Good in the School of Public and Environmental Affairs, Indiana University. A special thanks to many readers around the world who have eagerly provided constructive feedback and encouraged me to keep improving this document.


Revision History

  • 2003. 04 First draft.
  • 2004. 07 Second draft.
  • 2005. 09 Third draft (Added bivariate logit/probit and nested logit models).
  • 2008. 10 Fourth draft (Added SAS ODS and selected SPSS output)
  • 2009. 09 Fifth draft (Estimated models using different data and rewrote chapter 5, 6)


Up: Table of Contents
Prev: The Nested Logit Model