header
Contact Us
Back to Stat/Math home

Regression Models for Event Count Data Using SAS, STATA, and LIMDEP

Hun Myoung Park
Software Consultant, UITS Center for Statistical and Mathematical Computing
Contact: kucc625 at indiana.edu
Last Updated on September 15, 2005

Table of Contents (See the PDF format)


This document summarizes regression models for event count data and illustrates how to estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS 9.1, STATA 9.0, and LIMDEP 8.0.

1. Introduction

An event count is the realization of a nonnegative integer-valued random variable (Cameron and Trivedi 1998). Examples are the number of car accidents per month, thunder storms per year, and wild fires per year. The ordinary least squares (OLS) method for event count data results in biased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developed various nonlinear models that are based on the Poisson distribution and negative binomial distribution.

1.1 Count Data Regression Models

The left-hand side (LHS) of the equation has event count data. Independent variables are, as in the OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, or binary (dummy). Table 1 below summarizes the categorical dependent variable regression models (CDVMs) according to the level of measurement of the dependent variable.

Table 1. Comparison between OLS and CDVMs

   Model Dependent (LHS) Method Independent (RHS)
 OLS Ordinary least squares Interval or ratio scale  Moment based method  A linear function of interval/ratio or binary independent variables
 CDVMs  Binary response Binary (0 or 1)  Maximum Likelihood Method
Ordinal response Ordinal (1st, 2nd, ...)
Nominal response Nominal (A, B, ...)
Event count data Count (0, 1, 2, ...)

The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basic models for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated negative binomial regression model (ZINB) is used when there are many zero counts. Other count models are developed to handle censored, truncated, or sample selected count data. This document, however, focuses on PRM, NBRM, ZIP, and ZINB.

1.2 Poisson Models versus Negative Binomial Models

The Poisson probability distribution, , has the same mean and variance (equidispersion), Var(y)=E(y)=mu. As the mean of a Poisson distribution increases, the probability of zeros decreases and the distribution approximates a normal distribution (Figure 1). The Poisson distribution also has the strong assumption that events are independent. Thus, this distribution does not fit well if differs across observations (heterogeneity) (Long 1997).

The Poisson regression model (PRM) incorporates observed heterogeneity into the Poisson distribution function, Var(y|x)=E(y|x)=mu=exp(xb). As mu increases, the conditional variance of y increases, the proportion of predicted zeros decreases, and the distribution around the expected value becomes approximately normal (Long 1997). The conditional mean of the errors is zero, but the variance of the errors is a function of independent variables, var(y|x)=exp(xb). The errors are heteroscedastic. Thus, the PRM rarely fits in practice due to overdispersion (Long 1997; Maddala 1983).

Figure 1. Poisson Probability Distribution with Means of .5, 1, 2, and 5

The negative binomial probability distribution is , where 1/v=alpha determines the degree of dispersion and the Gamma is the Gamma probability distribution. As the dispersion parameter alpha increases, the variance of the negative binomial distribution also increases, Var(y|x)=mu(1+mu/v).

The negative binomial regression model (NBRM) incorporates observed and unobserved heterogeneity into the conditional mean, mu=exp(xb+e) (Long 1997). Thus, the conditional variance of y becomes larger than its conditional mean, E(y|x)=mu, which remains unchanged. Figure 2 illustrates how the probabilities for small and larger counts increase in the negative binomial distribution as the conditional variance of y increases, given mu=2.

Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5

The PRM and NBRM, however, have the same mean structure. If , the NBRM reduces to the PRM (Cameron and Trivedi 1998; Long 1997).

1.3 Overdispersion

When Var(y|x) > E(y|x), we are said to have overdispersion. Estimates of a PRM for overdispersed data are unbiased, but inefficient with standard errors biased downward (Cameron and Trivedi 1998; Long 1997). The likelihood ratio test for overdispersion examines the null hypothesis of alpha=0. The LR statistic follows the Chi-squared distribution with one degree of freedom. If the null hypothesis is rejected, NBRM is preferred to PRM.

Zero-inflated models handle overdispersion by changing the mean structure to explicitly model the production of zero counts (Long 1997). These models assume two latent groups. One is the always-zero group and the other is not-always-zero or sometime-zero group (Long 1997). Thus, zero counts come from the former group and some of the latter group with a certain probability.

The likelihood ratio tests the null hypothesis of alpha=0 to compare the ZIP and NBRM. The PRM and ZIP, and NBRM and ZINB cannot, however, be tested by this likelihood ratio, since they are not nested respectively. The Voung’s statistic compares these non-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96, the PRM or NBRM is preferred (Long 1997).

1.4 Estimation in SAS, STATA, and LIMDEP

The SAS GENMOD estimates Poisson and negative binomial regression models. STATA has individual commands (e.g., .poisson and .nbreg) for the corresponding count data models. LIMDEP has Poisson$ and Negbin$ commands to estimate various count data models including zero-inflated and zero-truncated models. Table 2 summarizes the procedures and commands for count data regression models.

Table 2. Comparison of the Procedures and Commands for Count Data Models

Model SAS 9.1 STATA 9.0 SE LIMDEP 8.0
Poisson Regression (PRM) GENMOD .poission Poisson$
Negative Binomial Regression (NBRM) GENMOD .nbreg Negbin$
Zero-infliated Poisson (ZIP) - .zip Poisson; Zip; Rh2$
Zero-Inflacted Negative Binomial (ZINB) - .zinb Negbin; Zip; Rh2$
Zero-truncated Poisson (ZTP) - .zip Poisson; Truncation$
Zero-truncated Negative Binomial (ZTNB) - .zinb Negbin; Truncation$

The example here examines how waste quotas (emps) and the strictness of policy implementation (strict) affect the frequency of waste spill accidents of plants (accident).

1.5 Long and Freese's SPost Module

STATA users may take advantages of user-written modules such as SPost written by J. Scott Long and Jeremy Freese. The module allows researchers to conduct follow-up analyses of various CDVMs including event count data models. See 2.3 for examples of major SPost commands.

In order to install SPost, execute the following commands consecutively. For more details, visit J. Scott Long’s Web site at http://www.indiana.edu/~jslsoc/spost_install.htm.

. net from http://www.indiana.edu/~jslsoc/stata/

. net install spostado, replace/

. net get spostrm7/


Top

2. Poisson Regression Model (PRM)

The SAS GENMOD procedure, STATA .poisson command, and LIMDEP Poisson$ command estimate the Poisson regression model (PRM).


2.1 PRM in SAS

SAS has the GENMOD procedure for the PRM. The /DIST=POISSON option tells SAS to use the Poisson distribution.


PROC GENMOD DATA = masil.accident;
   MODEL accident=emps strict /DIST=POISSON LINK=LOG;
RUN;

                                      The GENMOD Procedure
 
                                       Model Information
 
                               Data Set              COUNT.WASTE
                               Distribution              Poisson
                               Link Function                 Log
                               Dependent Variable       Accident
                               Observations Used             778
 
 
                             Criteria For Assessing Goodness Of Fit
 
                  Criterion                 DF           Value        Value/DF
 
                  Deviance                 775       2827.2079          3.6480
                  Scaled Deviance          775       2827.2079          3.6480
                  Pearson Chi-Square       775       4944.9473          6.3806
                  Scaled Pearson X2        775       4944.9473          6.3806
                  Log Likelihood                     -667.2291
 
          Algorithm converged.
 
 
                                Analysis Of Parameter Estimates
 
                                   Standard     Wald 95% Confidence       Chi-
    Parameter    DF    Estimate       Error           Limits            Square    Pr > ChiSq
 
    Intercept     1      0.3901      0.0467      0.2986      0.4816      69.84        <.0001
    Emps          1      0.0054      0.0007      0.0040      0.0069      53.13        <.0001
    Strict        1     -0.7042      0.0668     -0.8350     -0.5733     111.25        <.0001
    Scale         0      1.0000      0.0000      1.0000      1.0000
 
NOTE: The scale parameter was held fixed.

You will need to run a restricted model without regressors in order to conduct the likelihood ratio test for goodness-of-fit. The chi-squared statistic is 124.8218 = 2* [-667.2291 - (-729.6400)] (p<.0000).

PROC GENMOD DATA = masil.accident;
   MODEL accident= /DIST=POISSON LINK=LOG;
RUN;

                                      The GENMOD Procedure
 
                                       Model Information
 
                              Data Set              MASIL.ACCIDENT
                              Distribution                 Poisson
                              Link Function                    Log
                              Dependent Variable          accident
 
 
                            Number of Observations Read         778
                            Number of Observations Used         778
 
 
                             Criteria For Assessing Goodness Of Fit
 
                  Criterion                 DF           Value        Value/DF
 
                  Deviance                 777       2952.0297          3.7993
                  Scaled Deviance          777       2952.0297          3.7993
                  Pearson Chi-Square       777       4919.9745          6.3320
                  Scaled Pearson X2        777       4919.9745          6.3320
                  Log Likelihood                     -729.6400
 
          Algorithm converged.
 
 
                                Analysis Of Parameter Estimates
 
                                   Standard     Wald 95% Confidence       Chi-
    Parameter    DF    Estimate       Error           Limits            Square    Pr > ChiSq
 
    Intercept     1      0.3168      0.0306      0.2568      0.3768     107.20        <.0001
    Scale         0      1.0000      0.0000      1.0000      1.0000
 
NOTE: The scale parameter was held fixed.

Top

2.2 PRM in STATA

STATA has the .poisson command for the PRM. This command provides likelihood ratio and Pseudo R2 statistics.

. poisson accident emps strict

Iteration 0:   log likelihood = -1821.5112 
Iteration 1:   log likelihood = -1821.5101 
Iteration 2:   log likelihood = -1821.5101 
 
Poisson regression                                Number of obs   =        778
                                                  LR chi2(2)      =     124.82
                                                  Prob > chi2     =     0.0000
Log likelihood = -1821.5101                       Pseudo R2       =     0.0331
 
------------------------------------------------------------------------------
    accident |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        emps |   .0054186   .0007434     7.29   0.000     .0039615    .0068757
      strict |  -.7041664   .0667619   -10.55   0.000    -.8350174   -.5733154
       _cons |   .3900961   .0466787     8.36   0.000     .2986076    .4815846
------------------------------------------------------------------------------

Let us run a restricted model and then run the .display command in order to double check that the likelihood ratio for goodness-of-fit is 124.8218.

. poisson accident

Iteration 0:   log likelihood =  -1883.921 
Iteration 1:   log likelihood =  -1883.921 
 
Poisson regression                                Number of obs   =        778
                                                  LR chi2(0)      =       0.00
                                                  Prob > chi2     =          .
Log likelihood =  -1883.921                       Pseudo R2       =     0.0000
 
------------------------------------------------------------------------------
    accident |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   .3168165   .0305995    10.35   0.000     .2568426    .3767904
------------------------------------------------------------------------------

. display 2 * (-1821.5101 - (-1883.921))
124.8218


Top

2.3 Using the SPost Module in STATA

The SPost module provides useful commands for follow-up analyses of various categorical dependent variable models. The .fitstat command calculates various goodness-of-fit statistics such as log likelihood, McFadden’s R2 (or Pseudo R2), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC).

. quietly poisson accident emps strict

. fitstat

Measures of Fit for poisson of accident
 
Log-Lik Intercept Only:    -1883.921     Log-Lik Full Model:        -1821.510
D(775):                     3643.020     LR(2):                       124.822
                                         Prob > LR:                     0.000
McFadden's R2:                 0.033     McFadden's Adj R2:             0.032
Maximum Likelihood R2:         0.148     Cragg & Uhler's R2:            0.149
AIC:                           4.690     AIC*n:                      3649.020
BIC:                       -1515.943     BIC':                       -111.508

The .listcoef command lists unstandardized coefficients (parameter estimates), factor and percent changes, and standardized coefficients to help interpret regression results.

. listcoef, help

poisson (N=778): Factor Change in Expected Count
 
 Observed SD: 2.9482675
 
----------------------------------------------------------------------
    accident |      b         z     P>|z|    e^b    e^bStdX      SDofX
-------------+--------------------------------------------------------
        emps |   0.00542    7.289   0.000   1.0054   1.2297    38.1548
      strict |  -0.70417  -10.547   0.000   0.4945   0.7031     0.5003
----------------------------------------------------------------------
       b = raw coefficient
       z = z-score for test of b=0
   P>|z| = p-value for z-test
     e^b = exp(b) = factor change in expected count for unit increase in X
 e^bStdX = exp(b*SD of X) = change in expected count for SD increase in X
   SDofX = standard deviation of X

The .prtab command constructs a table of predicted values (events) for all combinations of categorical variables listed. The following example shows that the predicted number of accidents under the strict policy is .9172 at the mean waste quota (emps=42.0129).

. prtab strict

poisson: Predicted rates for accident
 
----------------------
   strict | Prediction
----------+-----------
        0 |     1.8547
        1 |     0.9172
----------------------
 
         emps     strict
x=  42.012853  .50771208

The .prvalue lists predicted values for a given set of values for the independent variables. For example, the predicted probability of a zero count is .3996 at the mean waste quota under the strict policy (strict=1). Note that the predicted rate of .917 is equivalent to .9172 in the .prtab above.

. prvalue, x(strict=1) maxcnt(5)

poisson: Predictions for accident
 
Predicted rate: .917     95% CI [.827   ,   1.02]
 
Predicted probabilities:
 
  Pr(y=0|x):   0.3996  Pr(y=1|x):   0.3665
  Pr(y=2|x):   0.1681  Pr(y=3|x):   0.0514
  Pr(y=4|x):   0.0118  Pr(y=5|x):   0.0022
 
         emps     strict
x=  42.012853          1

The most useful command is the .prchange that calculates marginal effects (changes) and discrete changes. For instance, a standard deviation increase in waste quota form its mean will increase accidents by .3841 under the lenient policy (strict=0).

. prchange, x(strict=0)

poisson: Changes in Predicted Rate for accident
 
        min->max      0->1     -+1/2    -+sd/2  MargEfct
  emps    2.3070    0.0080    0.0101    0.3841    0.0101
strict   -0.9375   -0.9375   -1.3332   -0.6568   -1.3060
 
exp(xb):   1.8547
 
           emps   strict
    x=  42.0129        0
sd(x)=  38.1548  .500262

SPost also includes the .prgen command, which computes a series of predictions by holding all variables but one constant and allowing that variable to vary (Long and Freese 2003). These SPost commands work with most categorical and count data models such as .logit, .probit, .poisson, .nbreg, .zip, and .zinb.

Top

2.4 PRM in LIMDEP

The LIMDEP Poisson$ command estimates the PRM. LIMDEP reports log likelihoods of both the unrestricted and restricted models. Keep in mind that you must include the ONE for the intercept.

POISSON;
   Lhs=ACCIDENT;
   Rhs=ONE,EMPS,STRICT$

+---------------------------------------------+
| Poisson Regression                          |
| Maximum Likelihood Estimates                |
| Model estimated: Aug 24, 2005 at 04:56:45PM.|
| Dependent variable             ACCIDENT     |
| Weighting variable                 None     |
| Number of observations              778     |
| Iterations completed                  8     |
| Log likelihood function       -1821.510     |
| Restricted log likelihood     -1883.921     |
| Chi squared                    124.8218     |
| Degrees of freedom                    2     |
| Prob[ChiSqd > value] =         .0000000     |
| Chi- squared =  4944.94781  RsqP=  -.0051   |
| G  - squared =  2827.20794  RsqD=   .0423   |
| Overdispersion tests: g=mu(i)  :  4.720     |
| Overdispersion tests: g=mu(i)^2:  4.253     |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .3900961420   .46678663E-01    8.357   .0000
 EMPS      .5418599057E-02  .74341923E-03    7.289   .0000     42.012853
 STRICT      -.7041663804   .66761926E-01  -10.547   .0000     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors (Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291 versus -1821.5101). This difference seems to come from the generalized linear model that the GENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that they result in the same likelihood ratio.

Top

3. Negative Binomial Regression Model (NBRM)

The SAS GENMODE procedure, STATA .nbreg command, and LIMDEP Negbin$ command estimate the negative binomial regression model (NBRM).


3.1 NBRM in SAS

The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that the dispersion parameter is equivalent to the alpha in STATA and LIMDEP.

PROC GENMOD DATA = masil.accident;
   MODEL accident=emps strict /DIST=NEGBIN LINK=LOG;
RUN;

                                      The GENMOD Procedure
 
                                       Model Information
 
                            Data Set                    COUNT.WASTE
                            Distribution          Negative Binomial
                            Link Function                       Log
                            Dependent Variable             Accident
                            Observations Used                   778
 
 
                             Criteria For Assessing Goodness Of Fit
 
                  Criterion                 DF           Value        Value/DF
 
                  Deviance                 775        589.7752          0.7610
                  Scaled Deviance          775        589.7752          0.7610
                  Pearson Chi-Square       775        845.6033          1.0911
                  Scaled Pearson X2        775        845.6033          1.0911
                  Log Likelihood                       37.5628
 
 
          Algorithm converged.
 
                                Analysis Of Parameter Estimates
 
                                   Standard     Wald 95% Confidence       Chi-
   Parameter     DF    Estimate       Error           Limits            Square    Pr > ChiSq
 
   Intercept      1      0.3851      0.1278      0.1345      0.6357       9.07        0.0026
   Emps           1      0.0052      0.0023      0.0008      0.0096       5.29        0.0214
   Strict         1     -0.6703      0.1671     -0.9978     -0.3427      16.09        <.0001
   Dispersion     1      3.9554      0.3501      3.3254      4.7048
 
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.

The restricted model produces a log likelihood of 28.8627. Thus, the likelihood ratio for goodness-of-fit is 17.4002 = 2 * (37.5628 – 28.8627) (p<.00017).

PROC GENMOD DATA = masil.accident;
   MODEL accident= /DIST=NEGBIN LINK=LOG;
RUN;

The likelihood ratio for overdispersion is 1409.5838 = 2 * (37.5628 - (-667.2291)).

Top

3.2 NBRM in STATA

STATA has the .nbreg command for the NBRM. The command reports three log likelihood statistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (full model), which make it easy to conduct likelihood ratio tests.

. nbreg accident emps strict

Fitting comparison Poisson model:
 
Iteration 0:   log likelihood = -1821.5112 
Iteration 1:   log likelihood = -1821.5101 
Iteration 2:   log likelihood = -1821.5101 
 
Fitting constant-only model:
 
Iteration 0:   log likelihood = -1256.6761 
Iteration 1:   log likelihood = -1152.6155 
Iteration 2:   log likelihood = -1125.6643 
Iteration 3:   log likelihood = -1125.4183 
Iteration 4:   log likelihood = -1125.4183 
 
Fitting full model:
 
Iteration 0:   log likelihood = -1117.1731 
Iteration 1:   log likelihood = -1116.7201 
Iteration 2:   log likelihood = -1116.7182 
Iteration 3:   log likelihood = -1116.7182 
 
Negative binomial regression                      Number of obs   =        778
                                                  LR chi2(2)      =      17.40
                                                  Prob > chi2     =     0.0002
Log likelihood = -1116.7182                       Pseudo R2       =     0.0077
 
------------------------------------------------------------------------------
    accident |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        emps |   .0051981   .0022595     2.30   0.021     .0007694    .0096267
      strict |  -.6702548   .1671191    -4.01   0.000    -.9978021   -.3427074
       _cons |   .3851111   .1278468     3.01   0.003      .134536    .6356861
-------------+----------------------------------------------------------------
    /lnalpha |    1.37509   .0885176                      1.201599    1.548582
-------------+----------------------------------------------------------------
       alpha |   3.955434   .3501257                       3.32543    4.704793
------------------------------------------------------------------------------
Likelihood ratio test of alpha=0:  chibar2(01) = 1409.58 Prob>=chibar2 = 0.000

The restricted model or “constant-only model” gives us a log likelihood -1125.4183. Thus, the likelihood ratio for goodness-of-fit is 17.4002 = 2 * [-1116.7182 - (-1125.4183)] (p<.00017). The p-value is computed as follows (Note the .disp or .di is an abbreviation of the .display).

. disp chi2tail(2, 17.4002)
.00016657

The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p<.0000) and rejects the null hypothesis of alpha=0. The statistically significant evidence of overdispersion indicates that the NBRM is preferred to the PRM.

. di 2 * (-1116.7182 - (-1821.5101))
1409.5838

The p-value of the likelihood ratio for overdispersion is computed as,

. di chi2tail(1, 1409.5838)
1.74e-308

Now, let us calculate marginal effects (or changes) at the means of independent variables. You should the read the discrete change labeled “0->1” of a binary variable strict, since its marginal change at the mean (.5077) is meaningless.

. prchange

nbreg: Changes in Predicted Rate for accident
 
        min->max      0->1     -+1/2    -+sd/2  MargEfct
  emps    1.5326    0.0055    0.0068    0.2585    0.0068
strict   -0.8931   -0.8931   -0.8885   -0.4383   -0.8721
 
exp(xb):   1.3011
 
           emps   strict
    x=  42.0129  .507712
sd(x)=  38.1548  .500262

Top


3.3 NBRM in LIMDEP

LIMDEP has the Negbin$ command for the NBRM that reports the PRM as well. Note that the standard errors of parameter estimates are slightly different from those of SAS and STATA. The Marginal Effects$ and the Means$ subcommands compute marginal effects at the mean of independent variables. You may not omit the Means$ subcommand.

NEGBIN;
   Lhs=ACCIDENT;
   Rhs=ONE,EMPS,STRICT;
   Marginal Effects;
   Means$

+---------------------------------------------+
| Poisson Regression                          |
| Maximum Likelihood Estimates                |
| Model estimated: Sep 08, 2005 at 09:35:36AM.|
| Dependent variable             ACCIDENT     |
| Weighting variable                 None     |
| Number of observations              778     |
| Iterations completed                  8     |
| Log likelihood function       -1821.510     |
| Restricted log likelihood     -1883.921     |
| Chi squared                    124.8218     |
| Degrees of freedom                    2     |
| Prob[ChiSqd > value] =         .0000000     |
| Chi- squared =  4944.94781  RsqP=  -.0051   |
| G  - squared =  2827.20794  RsqD=   .0423   |
| Overdispersion tests: g=mu(i)  :  4.720     |
| Overdispersion tests: g=mu(i)^2:  4.253     |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .3900961420   .46678663E-01    8.357   .0000
 EMPS      .5418599057E-02  .74341923E-03    7.289   .0000     42.012853
 STRICT      -.7041663804   .66761926E-01  -10.547   .0000     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
 
Normal exit from iterations. Exit status=0.
 
+---------------------------------------------+
| Negative Binomial Regression                |
| Maximum Likelihood Estimates                |
| Model estimated: Sep 08, 2005 at 09:35:36AM.|
| Dependent variable             ACCIDENT     |
| Weighting variable                 None     |
| Number of observations              778     |
| Iterations completed                  8     |
| Log likelihood function       -1116.718     |
| Restricted log likelihood     -1821.510     |
| Chi squared                    1409.584     |
| Degrees of freedom                    1     |
| Prob[ChiSqd > value] =         .0000000     |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .3851110699       .12855240    2.996   .0027
 EMPS      .5198057234E-02  .22602075E-02    2.300   .0215     42.012853
 STRICT      -.6702547660       .16729839   -4.006   .0001     .50771208
          Dispersion parameter for count data model
 Alpha        3.955434012       .35680876   11.086   .0000
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
 
 
+-------------------------------------------+
| Partial derivatives of expected val. with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs.  |
| Conditional Mean at Sample Point   1.3011 |
| Scale Factor for Marginal Effects  1.3011 |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .5010628939       .19396434    2.583   .0098
 EMPS      .6763123170E-02  .29746591E-02    2.274   .0230     42.012853
 STRICT      -.8720595665       .22469308   -3.881   .0001     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Read the coefficients (.0068 and -.8721) to confirm that they are identical to the corresponding marginal effects calculated in STATA.

SAS, STATA, and LIMDEP produce almost the same parameter estimates and goodness-of-fit statistics (Table 4). Note that SAS reports different log likelihoods, but the same likelihood ratio.

Figure 3 compares the PRM and NBRM. Look at the predictions for zero counts of the two models. As the likelihood ratio test indicates, the NBRM seems to fit these data better than PRM.

Figure 3. Comparison of the Poisson and Negative Binomial Regression Models



Top

4. Zero-inflated Poission Regression Model (ZIP)

STATA and LIMDEP have commands for the zero-inflated Poisson regression model (ZIP).

4.1 ZIP in STATA

STATA has the .zip command to estimate the ZIP. The inflate() option specifies a list of variables that determines whether the observed count is zero. The vuong option computes the Vuong statistic to compare the ZIP and PRM.

. zip accident emps strict, inflate(emps strict) vuong

Fitting constant-only model:
 
Iteration 0:   log likelihood = -1627.0779 
Iteration 1:   log likelihood = -1309.5825 
Iteration 2:   log likelihood =  -1272.433 
Iteration 3:   log likelihood = -1270.9543 
Iteration 4:   log likelihood = -1270.9523 
Iteration 5:   log likelihood = -1270.9523 
 
Fitting full model:
 
Iteration 0:   log likelihood = -1270.9523 
Iteration 1:   log likelihood = -1269.7219 
Iteration 2:   log likelihood = -1269.7206 
Iteration 3:   log likelihood = -1269.7206 
 
Zero-inflated Poisson regression                  Number of obs   =        778
                                                  Nonzero obs     =        280
                                                  Zero obs        =        498
 
Inflation model = logit                           LR chi2(2)      =       2.46
Log likelihood  = -1269.721                       Prob > chi2     =     0.2918
 
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
accident     |
        emps |   -.000277   .0008633    -0.32   0.748     -.001969     .001415
      strict |  -.0923911   .0729023    -1.27   0.205    -.2352771    .0504948
       _cons |   1.361978   .0493222    27.61   0.000     1.265308    1.458647
-------------+----------------------------------------------------------------
inflate      |
        emps |  -.0109897   .0022678    -4.85   0.000    -.0154344    -.006545
      strict |   1.057031   .1767509     5.98   0.000     .7106059    1.403457
       _cons |    .488656   .1211099     4.03   0.000     .2512849     .726027
------------------------------------------------------------------------------
Vuong test of zip vs. standard Poisson:            z =     8.40  Pr>z = 0.0000

The restricted model is estimated with the intercept only.

. zip accident, inflate(emps strict)

The Vuong statistic at the bottom compares the ZIP and PRM. Since the V 8.40 is greater than 1.96, we conclude that the ZIP is preferred to the PRM.

Top

4.2 ZIP in LIMDEP

The LIMDEP Poisson$ command needs to have the Zip and Rh2 subcommands. The Rh2 is equivalent to the inflate() option in STATA. The Alg=Newton$ subcommand is needed to use the Newton-Raphson algorithm because the default Broyden algorithm failed to converge.

POISSON;
   Lhs=ACCIDENT;
   Rhs=ONE,EMPS,STRICT;
   Rh2=ONE,EMPS,STRICT;
   ZIP; Alg=Newton$

+---------------------------------------------+
| Poisson Regression                          |
| Maximum Likelihood Estimates                |
| Model estimated: Sep 06, 2005 at 00:25:07PM.|
| Dependent variable             ACCIDENT     |
| Weighting variable                 None     |
| Number of observations              778     |
| Iterations completed                  8     |
| Log likelihood function       -1821.510     |
| Restricted log likelihood     -1883.921     |
| Chi squared                    124.8218     |
| Degrees of freedom                    2     |
| Prob[ChiSqd > value] =         .0000000     |
| Chi- squared =  4944.94781  RsqP=  -.0051   |
| G  - squared =  2827.20794  RsqD=   .0423   |
| Overdispersion tests: g=mu(i)  :  4.720     |
| Overdispersion tests: g=mu(i)^2:  4.253     |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .3900961420   .46678663E-01    8.357   .0000
 EMPS      .5418599057E-02  .74341923E-03    7.289   .0000     42.012853
 STRICT      -.7041663804   .66761926E-01  -10.547   .0000     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
 
 
Normal exit from iterations. Exit status=0.
 
 +----------------------------------------------------------------------+
 | Zero Altered Poisson      Regression Model                           |
 | Logistic distribution used for splitting model.                      |
 | ZAP term in probability is F[tau x Z(i)     ]                        |
 | Comparison of estimated models                                       |
 |             Pr[0|means]       Number of zeros        Log-likelihood  |
 | Poisson          .27329   Act.=   498 Prd.=   212.6     -1821.51007  |
 | Z.I.Poisson      .64642   Act.=   498 Prd.=   502.9     -1259.88568  |
 | Note, the ZIP log-likelihood is not directly comparable.             |
 | ZIP model with nonzero Q does not encompass the others.              |
 | Vuong statistic for testing ZIP vs. unaltered model is      9.5740   |
 | Distributed as standard normal. A value greater than                 |
 | +1.96 favors the zero altered Z.I.Poisson model.                     |
 | A value less than -1.96 rejects the ZIP model.                       |
 +----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
          Poisson/NB/Gamma regression model
 Constant     1.361977491   .23944641E-01   56.880   .0000
 EMPS     -.2770010575E-03  .37770090E-03    -.733   .4633     42.012853
 STRICT   -.9239125073E-01  .33326502E-01   -2.772   .0056     .50771208
          Zero inflation model
 Constant     .4886559537       .12210013    4.002   .0001
 EMPS     -.1098971050E-01  .22152492E-02   -4.961   .0000     42.012853
 STRICT       1.057031399       .17715551    5.967   .0000     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

In order to estimate the restricted model, run the following command with the ONE only in the Lhs$ subcommand. The Rh2$ subcommand remains unchanged.

POISSON;
   Lhs=ACCIDENT;
   Rhs=ONE;
   Rh2=ONE,EMPS,STRICT;
   ZIP; Alg=Newton$

STATA and LIMDEP report the same parameter estimates, but they produce different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious log likelihood for the restricted model, and thus ended up with the “unlikely” likelihood ratio of -.0304. In addition, the Vuong statistics in STATA and LIMDEP are different.

Top

5. Zero-inflated Negative Binomial Regression Model (ZINB)

STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB).

5.1 ZINB in STATA

The STATA .zinb command estimates the ZINB. The vuong option computes the Vuong statistic to compare the ZINB and NBRM.

. zinb accident emps strict, inflate(emps strict) vuong

Fitting constant-only model:
 
Iteration 0:   log likelihood = -1190.5117  (not concave)
Iteration 1:   log likelihood = -1106.9874 
Iteration 2:   log likelihood = -1098.8642 
Iteration 3:   log likelihood = -1095.3638 
Iteration 4:   log likelihood = -1094.0237 
Iteration 5:   log likelihood =  -1093.063 
Iteration 6:   log likelihood = -1092.6216 
Iteration 7:   log likelihood =  -1091.798 
Iteration 8:   log likelihood = -1091.7332 
Iteration 9:   log likelihood = -1091.7329 
Iteration 10:  log likelihood = -1091.7329 
 
Fitting full model:
 
Iteration 0:   log likelihood = -1091.7329 
Iteration 1:   log likelihood = -1089.5565 
Iteration 2:   log likelihood = -1089.5198 
Iteration 3:   log likelihood = -1089.5198 
 
Zero-inflated negative binomial regression        Number of obs   =        778
                                                  Nonzero obs     =        280
                                                  Zero obs        =        498
 
Inflation model = logit                           LR chi2(2)      =       4.43
Log likelihood  =  -1089.52                       Prob > chi2     =     0.1094
 
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
accident     |
        emps |  -.0004407   .0020554    -0.21   0.830    -.0044691    .0035877
      strict |  -.3251317   .1659173    -1.96   0.050    -.6503235    .0000602
       _cons |   .7763065   .1508037     5.15   0.000     .4807367    1.071876
-------------+----------------------------------------------------------------
inflate      |
        emps |  -.2087768   .0955122    -2.19   0.029    -.3959772   -.0215763
      strict |   7.562388   3.055775     2.47   0.013     1.573179     13.5516
       _cons |   .1032115   .3800045     0.27   0.786    -.6415835    .8480065
-------------+----------------------------------------------------------------
    /lnalpha |   .9252514   .1351387     6.85   0.000     .6603845    1.190118
-------------+----------------------------------------------------------------
       alpha |   2.522502   .3408876                      1.935536     3.28747
------------------------------------------------------------------------------
Vuong test of zinb vs. standard negative binomial: z =     4.13  Pr>z = 0.0000

The likelihood ratio, 360.4024= 2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of no overdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p<.0000). The Vuong test, 4.13 > 1.96, suggests that the ZINB is preferred to the NBRM.

Top

5.2 ZINB in LIMDEP

The LIMDEP Negbin$ command needs to have the Zip and Rh2 subcommands for the ZINB. The following command produces the Poisson regression model, negative binomial model, and zero-inflated negative binomial model. You may omit the Alg=Newton$ subcommand.

NEGBIN;
   Lhs=ACCIDENT;
   Rhs=ONE,EMPS,STRICT;
   Rh2=ONE,EMPS,STRICT;
   ZIP; Alg=Newton$

+---------------------------------------------+
| Poisson Regression                          |
| Maximum Likelihood Estimates                |
| Model estimated: Sep 10, 2005 at 00:20:00AM.|
| Dependent variable             ACCIDENT     |
| Weighting variable                 None     |
| Number of observations              778     |
| Iterations completed                  8     |
| Log likelihood function       -1821.510     |
| Restricted log likelihood     -1883.921     |
| Chi squared                    124.8218     |
| Degrees of freedom                    2     |
| Prob[ChiSqd > value] =         .0000000     |
| Chi- squared =  4944.94781  RsqP=  -.0051   |
| G  - squared =  2827.20794  RsqD=   .0423   |
| Overdispersion tests: g=mu(i)  :  4.720     |
| Overdispersion tests: g=mu(i)^2:  4.253     |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .3900961420   .46678663E-01    8.357   .0000
 EMPS      .5418599057E-02  .74341923E-03    7.289   .0000     42.012853
 STRICT      -.7041663804   .66761926E-01  -10.547   .0000     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
 
Normal exit from iterations. Exit status=0.
 
+---------------------------------------------+
| Negative Binomial Regression                |
| Maximum Likelihood Estimates                |
| Model estimated: Sep 10, 2005 at 00:20:00AM.|
| Dependent variable             ACCIDENT     |
| Weighting variable                 None     |
| Number of observations              778     |
| Iterations completed                 12     |
| Log likelihood function       -1116.718     |
| Restricted log likelihood     -1821.510     |
| Chi squared                    1409.584     |
| Degrees of freedom                    1     |
| Prob[ChiSqd > value] =         .0000000     |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
 Constant     .3851110482       .12855240    2.996   .0027
 EMPS      .5198057322E-02  .22602075E-02    2.300   .0215     42.012853
 STRICT      -.6702547787       .16729839   -4.006   .0001     .50771208
          Dispersion parameter for count data model
 Alpha        3.955434128       .35680877   11.086   .0000
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
 
Normal exit from iterations. Exit status=0.
 
 +----------------------------------------------------------------------+
 | Zero Altered Neg.Binomial Regression Model                           |
 | Logistic distribution used for splitting model.                      |
 | ZAP term in probability is F[tau x Z(i)     ]                        |
 | Comparison of estimated models                                       |
 |             Pr[0|means]       Number of zeros        Log-likelihood  |
 | Poisson          .27329   Act.=   498 Prd.=   212.6     -1821.51007  |
 | Neg. Bin.        .32470   Act.=   498 Prd.=   252.6     -1116.71820  |
 | Z.I.Neg_Bin      .62918   Act.=   498 Prd.=   489.5     -1089.51977  |
 | Note, the ZIP log-likelihood is not directly comparable.             |
 | ZIP model with nonzero Q does not encompass the others.              |
 | Vuong statistic for testing ZIP vs. unaltered model is      4.1270   |
 | Distributed as standard normal. A value greater than                 |
 | +1.96 favors the zero altered Z.I.Neg_Bin model.                     |
 | A value less than -1.96 rejects the ZIP model.                       |
 +----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient  | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
          Poisson/NB/Gamma regression model
 Constant     .7763063017       .15178042    5.115   .0000
 EMPS     -.4407244013E-03  .20262626E-02    -.218   .8278     42.012853
 STRICT      -.3251315411       .16179883   -2.009   .0445     .50771208
          Dispersion parameter
 Alpha        2.522502810       .29924002    8.430   .0000
          Zero inflation model
 Constant     .1032103951       .37413759     .276   .7827
 EMPS        -.2087767804   .68774937E-01   -3.036   .0024     42.012853
 STRICT       7.562389399       2.2216392    3.404   .0007     .50771208
 (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

In order to estimate the restricted model, run the following command. You have to use the Alg=Newton$ subcommand to get the restricted model to converge.

NEGBIN;
   Lhs=ACCIDENT;
   Rhs=ONE;
   Rh2=ONE,EMPS,STRICT;
   ZIP; Alg=Newton$

Figure 4. Comparison of the Zero-Inflated PRM and the Zero-Inflated NBRM



Top

6. Conclusion

Like other econometric models, researchers must first examine the data generation process of a dependent variable to understand its behavior. Sophisticated researchers pay special attention to excess zeros, censored and/or truncated counts, sample selection, and other particular patterns of the data generation, and then decide which model best describes the data generation process.

The Poisson regression model and negative binomial regression model have the same mean structure, but they describe the behavior of a dependent variable in different ways. Zero-inflated regression models integrate two different data generation processes to deal with overdispersion. Truncated or censored regression models are appropriate when data are (left and/or right) truncated or censored.

Researchers need to spend more time and effort interpreting the results substantively. Like other categorical dependent variable models, count data models produce estimates that are difficult to interpret intuitively. Reporting parameter estimates and goodness-of-fit statistics are not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful count data model interpretations.

Regarding statistical software, I would recommend STATA for general count data models and LIMDEP for special types of models. Although able to handle various models, LIMDEP does not seem stable and reliable. The SAS GENMODE procedure estimates the Poisson regression model and the negative binomial model, but it does not have easy ways of estimating other models. We encourage SAS Institute to develop an individual procedure, say the CLIM (Count and Limited Dependent Variable Model) procedure, to handle a variety of count data models.

Top


APPENDIX: Data Set

The data set used here is a part of the data provided for David H. Good’s class in the School of Public and Environmental Affairs, Indiana University. Note that these data have been manipulated for the sake of data security. The variables in the data set include,

Download (UITS Stat/Math Center): ASCII Text | STATA | SAS

The followings summarize descriptive statistics of these variables. Note that there are many zero counts that indicate an overdispersion problem.

. summarize accident emps strict

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    accident |       778    1.372751    2.948267          0         31
        emps |       778    42.01285     38.1548          1        174
      strict |       778    .5077121    .5002621          0          1

. tab accident strict

           |        strict
  accident |         0          1 |     Total
-----------+----------------------+----------
         0 |       214        284 |       498
         1 |        41         29 |        70
         2 |        38         32 |        70
         3 |        28         13 |        41
         4 |        16         13 |        29
         5 |        10          3 |        13
         6 |        12          7 |        19
         7 |         4          3 |         7
         8 |         4          2 |         6
         9 |         3          2 |         5
        10 |         0          2 |         2
        11 |         3          1 |         4
        12 |         2          0 |         2
        13 |         1          0 |         1
        14 |         1          0 |         1
        15 |         3          0 |         3
        16 |         1          0 |         1
        17 |         0          2 |         2
        18 |         1          1 |         2
        21 |         0          1 |         1
        31 |         1          0 |         1
-----------+----------------------+----------
     Total |       383        395 |       778


Top


References



Acknowledgements

I am grateful to Jeremy Albright and Kevin Wilhite at the Center for Statistical and Mathematical Computing, Indiana University, who provided valuable comments and suggestions.


Revision History