Regression Models for Event Count Data Using SAS, STATA, and LIMDEP
Hun Myoung Park
Software Consultant, UITS Center for Statistical and Mathematical Computing
Contact: kucc625 at indiana.edu
Last Updated on September 15, 2005
Table of Contents
(See the PDF format)
- Introduction
- Poisson Regression Model (PRM)
- Negative Binomial Regression Models (NBRM)
- Zero-Inflated Poisson Regression Model (ZIP)
- Zero-Inflated Negative Binomial Regression Model (ZINB)
- Conclusion
- Appendix
- References
This document summarizes regression models for event count data and illustrates how to estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS 9.1, STATA 9.0, and LIMDEP 8.0.
An event count is the realization of a nonnegative integer-valued random variable (Cameron and Trivedi 1998). Examples are the number of car accidents per month, thunder storms per year, and wild fires per year. The ordinary least squares (OLS) method for event count data results in biased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developed various nonlinear models that are based on the Poisson distribution and negative binomial distribution.
1.1 Count Data Regression Models
The left-hand side (LHS) of the equation has event count data. Independent variables are, as in the OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, or binary (dummy). Table 1 below summarizes the categorical dependent variable regression models (CDVMs) according to the level of measurement of the dependent variable.
Table 1. Comparison between OLS and CDVMs
| |
Model |
Dependent (LHS) |
Method |
Independent (RHS) |
| OLS |
Ordinary least squares |
Interval or ratio scale |
Moment based method |
A linear function of interval/ratio or binary independent variables |
| CDVMs |
Binary response |
Binary (0 or 1) |
Maximum Likelihood Method |
| Ordinal response |
Ordinal (1st, 2nd, ...) |
| Nominal response |
Nominal (A, B, ...) |
| Event count data |
Count (0, 1, 2, ...) |
The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basic models for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated negative binomial regression model (ZINB) is used when there are many zero counts. Other count models are developed to handle censored, truncated, or sample selected count data. This document, however, focuses on PRM, NBRM, ZIP, and ZINB.
1.2 Poisson Models versus Negative Binomial Models
The Poisson probability distribution,

, has the same mean and variance (equidispersion), Var(y)=E(y)=mu. As the mean of a Poisson distribution increases, the probability of zeros decreases and the distribution approximates a normal distribution (Figure 1). The Poisson distribution also has the strong assumption that events are independent. Thus, this distribution does not fit well if differs across observations (heterogeneity) (Long 1997).
The Poisson regression model (PRM) incorporates observed heterogeneity into the Poisson distribution function, Var(y|x)=E(y|x)=mu=exp(xb). As mu increases, the conditional variance of y increases, the proportion of predicted zeros decreases, and the distribution around the expected value becomes approximately normal (Long 1997). The conditional mean of the errors is zero, but the variance of the errors is a function of independent variables, var(y|x)=exp(xb). The errors are heteroscedastic. Thus, the PRM rarely fits in practice due to overdispersion (Long 1997; Maddala 1983).
Figure 1. Poisson Probability Distribution with Means of .5, 1, 2, and 5
The negative binomial probability distribution is

, where 1/v=alpha determines the degree of dispersion and the Gamma is the Gamma probability distribution. As the dispersion parameter alpha increases, the variance of the negative binomial distribution also increases, Var(y|x)=mu(1+mu/v).
The negative binomial regression model (NBRM) incorporates observed and unobserved heterogeneity into the conditional mean, mu=exp(xb+e) (Long 1997). Thus, the conditional variance of y becomes larger than its conditional mean, E(y|x)=mu, which remains unchanged. Figure 2 illustrates how the probabilities for small and larger counts increase in the negative binomial distribution as the conditional variance of y increases, given mu=2.
Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5
The PRM and NBRM, however, have the same mean structure. If , the NBRM reduces to the PRM (Cameron and Trivedi 1998; Long 1997).
1.3 Overdispersion
When Var(y|x) > E(y|x), we are said to have overdispersion. Estimates of a PRM for overdispersed data are unbiased, but inefficient with standard errors biased downward (Cameron and Trivedi 1998; Long 1997). The likelihood ratio test for overdispersion examines the null hypothesis of alpha=0. The LR statistic follows the Chi-squared distribution with one degree of freedom. If the null hypothesis is rejected, NBRM is preferred to PRM.
Zero-inflated models handle overdispersion by changing the mean structure to explicitly model the production of zero counts (Long 1997). These models assume two latent groups. One is the always-zero group and the other is not-always-zero or sometime-zero group (Long 1997). Thus, zero counts come from the former group and some of the latter group with a certain probability.
The likelihood ratio tests the null hypothesis of alpha=0 to compare the ZIP and NBRM. The PRM and ZIP, and NBRM and ZINB cannot, however, be tested by this likelihood ratio, since they are not nested respectively. The Voung’s statistic compares these non-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96, the PRM or NBRM is preferred (Long 1997).
1.4 Estimation in SAS, STATA, and LIMDEP
The SAS GENMOD estimates Poisson and negative binomial regression models. STATA has individual commands (e.g., .poisson and .nbreg) for the corresponding count data models. LIMDEP has Poisson$ and Negbin$ commands to estimate various count data models including zero-inflated and zero-truncated models. Table 2 summarizes the procedures and commands for count data regression models.
Table 2. Comparison of the Procedures and Commands for Count Data Models
| Model |
SAS 9.1 |
STATA 9.0 SE |
LIMDEP 8.0 |
| Poisson Regression (PRM) |
GENMOD |
.poission |
Poisson$ |
| Negative Binomial Regression (NBRM) |
GENMOD |
.nbreg |
Negbin$ |
| Zero-infliated Poisson (ZIP) |
- |
.zip |
Poisson; Zip; Rh2$ |
| Zero-Inflacted Negative Binomial (ZINB) |
- |
.zinb |
Negbin; Zip; Rh2$ |
| Zero-truncated Poisson (ZTP) |
- |
.zip |
Poisson; Truncation$ |
| Zero-truncated Negative Binomial (ZTNB) |
- |
.zinb |
Negbin; Truncation$ |
The example here examines how waste quotas (emps) and the strictness of policy implementation (strict) affect the frequency of waste spill accidents of plants (accident).
1.5 Long and Freese's SPost Module
STATA users may take advantages of user-written modules such as SPost written by J. Scott Long and Jeremy Freese. The module allows researchers to conduct follow-up analyses of various CDVMs including event count data models. See 2.3 for examples of major SPost commands.
In order to install SPost, execute the following commands consecutively. For more details, visit J. Scott Long’s Web site at
http://www.indiana.edu/~jslsoc/spost_install.htm.
. net from http://www.indiana.edu/~jslsoc/stata/
. net install spostado, replace/
. net get spostrm7/
Top
The SAS GENMOD procedure, STATA .poisson command, and LIMDEP Poisson$ command estimate the Poisson regression model (PRM).
2.1 PRM in SAS
SAS has the GENMOD procedure for the PRM. The /DIST=POISSON option tells SAS to use the Poisson distribution.
PROC GENMOD DATA = masil.accident;
MODEL accident=emps strict /DIST=POISSON LINK=LOG;
RUN;
The GENMOD Procedure
Model Information
Data Set
COUNT.WASTE
Distribution
Poisson
Link Function
Log
Dependent Variable Accident
Observations
Used
778
Criteria For Assessing Goodness Of Fit
Criterion
DF Value
Value/DF
Deviance
775 2827.2079
3.6480
Scaled Deviance 775
2827.2079 3.6480
Pearson Chi-Square 775
4944.9473 6.3806
Scaled Pearson X2 775
4944.9473 6.3806
Log Likelihood
-667.2291
Algorithm converged.
Analysis Of Parameter Estimates
Standard Wald 95% Confidence
Chi-
Parameter DF Estimate
Error Limits
Square Pr > ChiSq
Intercept 1 0.3901
0.0467 0.2986 0.4816
69.84 <.0001
Emps 1
0.0054 0.0007 0.0040
0.0069 53.13 <.0001
Strict 1 -0.7042
0.0668 -0.8350 -0.5733
111.25 <.0001
Scale 0
1.0000 0.0000 1.0000
1.0000
NOTE:
The scale parameter was held fixed.
You will need to run a restricted model without regressors in order to conduct the likelihood ratio test for goodness-of-fit. The chi-squared statistic is 124.8218 = 2* [-667.2291 - (-729.6400)] (p<.0000).
PROC GENMOD DATA = masil.accident;
MODEL accident= /DIST=POISSON LINK=LOG;
RUN;
The GENMOD Procedure
Model Information
Data
Set
MASIL.ACCIDENT
Distribution
Poisson
Link Function
Log
Dependent Variable accident
Number
of Observations Read 778
Number of Observations Used
778
Criteria For Assessing Goodness Of Fit
Criterion
DF Value
Value/DF
Deviance
777 2952.0297
3.7993
Scaled Deviance 777
2952.0297 3.7993
Pearson Chi-Square 777
4919.9745 6.3320
Scaled Pearson X2 777
4919.9745 6.3320
Log Likelihood
-729.6400
Algorithm converged.
Analysis Of Parameter Estimates
Standard
Wald 95% Confidence Chi-
Parameter DF Estimate
Error Limits
Square Pr > ChiSq
Intercept 1 0.3168
0.0306 0.2568 0.3768
107.20 <.0001
Scale 0
1.0000 0.0000 1.0000
1.0000
NOTE:
The scale parameter was held fixed.
Top
2.2 PRM in STATA
STATA has the .poisson command for the PRM. This command provides likelihood ratio and Pseudo R2 statistics.
. poisson accident emps strict
Iteration
0: log likelihood = -1821.5112
Iteration
1: log likelihood = -1821.5101
Iteration
2: log likelihood = -1821.5101
Poisson
regression
Number of obs = 778
LR
chi2(2) = 124.82
Prob > chi2 = 0.0000
Log
likelihood = -1821.5101
Pseudo R2 = 0.0331
------------------------------------------------------------------------------
accident | Coef. Std. Err.
z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
emps | .0054186 .0007434 7.29
0.000 .0039615 .0068757
strict | -.7041664 .0667619 -10.55
0.000 -.8350174 -.5733154
_cons | .3900961 .0466787 8.36
0.000 .2986076 .4815846
------------------------------------------------------------------------------
Let us run a restricted model and then run the .display command in order to double check that the likelihood ratio for goodness-of-fit is 124.8218.
. poisson accident
Iteration
0: log likelihood = -1883.921
Iteration
1: log likelihood = -1883.921
Poisson
regression
Number of obs = 778
LR chi2(0) =
0.00
Prob > chi2 =
.
Log
likelihood = -1883.921
Pseudo R2 = 0.0000
------------------------------------------------------------------------------
accident | Coef. Std. Err.
z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | .3168165 .0305995 10.35
0.000 .2568426 .3767904
------------------------------------------------------------------------------
. display 2 * (-1821.5101 - (-1883.921))
124.8218
Top
2.3 Using the SPost Module in STATA
The SPost module provides useful commands for follow-up analyses of various categorical dependent variable models. The .fitstat command calculates various goodness-of-fit statistics such as log likelihood, McFadden’s R2 (or Pseudo R2), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC).
. quietly poisson accident emps strict
. fitstat
Measures
of Fit for poisson of accident
Log-Lik
Intercept Only: -1883.921 Log-Lik
Full Model: -1821.510
D(775):
3643.020 LR(2):
124.822
Prob > LR:
0.000
McFadden's
R2:
0.033 McFadden's Adj R2:
0.032
Maximum
Likelihood R2: 0.148
Cragg & Uhler's R2:
0.149
AIC:
4.690 AIC*n:
3649.020
BIC:
-1515.943 BIC':
-111.508
The .listcoef command lists unstandardized coefficients (parameter estimates), factor and percent changes, and standardized coefficients to help interpret regression results.
. listcoef, help
poisson
(N=778): Factor Change in Expected Count
Observed
SD: 2.9482675
----------------------------------------------------------------------
accident | b
z P>|z| e^b e^bStdX
SDofX
-------------+--------------------------------------------------------
emps | 0.00542 7.289 0.000
1.0054 1.2297 38.1548
strict | -0.70417 -10.547 0.000 0.4945
0.7031 0.5003
----------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
e^b = exp(b) = factor change in expected count for unit increase in X
e^bStdX
= exp(b*SD of X) = change in expected count for SD increase in X
SDofX = standard deviation of X
The .prtab command constructs a table of predicted values (events) for all combinations of categorical variables listed. The following example shows that the predicted number of accidents under the strict policy is .9172 at the mean waste quota (emps=42.0129).
. prtab strict
poisson:
Predicted rates for accident
----------------------
strict | Prediction
----------+-----------
0 | 1.8547
1 | 0.9172
----------------------
emps strict
x=
42.012853 .50771208
The .prvalue lists predicted values for a given set of values for the independent variables. For example, the predicted probability of a zero count is .3996 at the mean waste quota under the strict policy (strict=1). Note that the predicted rate of .917 is equivalent to .9172 in the .prtab above.
. prvalue, x(strict=1) maxcnt(5)
poisson:
Predictions for accident
Predicted
rate: .917 95% CI [.827 , 1.02]
Predicted
probabilities:
Pr(y=0|x): 0.3996 Pr(y=1|x): 0.3665
Pr(y=2|x): 0.1681 Pr(y=3|x): 0.0514
Pr(y=4|x): 0.0118 Pr(y=5|x): 0.0022
emps strict
x=
42.012853 1
The most useful command is the .prchange that calculates marginal effects (changes) and discrete changes. For instance, a standard deviation increase in waste quota form its mean will increase accidents by .3841 under the lenient policy (strict=0).
. prchange, x(strict=0)
poisson:
Changes in Predicted Rate for accident
min->max 0->1 -+1/2
-+sd/2 MargEfct
emps 2.3070 0.0080 0.0101
0.3841 0.0101
strict
-0.9375 -0.9375 -1.3332 -0.6568
-1.3060
exp(xb):
1.8547
emps strict
x= 42.0129 0
sd(x)=
38.1548 .500262
SPost also includes the .prgen command, which computes a series of predictions by holding all variables but one constant and allowing that variable to vary (Long and Freese 2003). These SPost commands work with most categorical and count data models such as .logit, .probit, .poisson, .nbreg, .zip, and .zinb.
Top
2.4 PRM in LIMDEP
The LIMDEP Poisson$ command estimates the PRM. LIMDEP reports log likelihoods of both the unrestricted and restricted models. Keep in mind that you must include the ONE for the intercept.
POISSON;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT$
+---------------------------------------------+
|
Poisson Regression
|
|
Maximum Likelihood Estimates
|
|
Model estimated: Aug 24, 2005 at 04:56:45PM.|
|
Dependent variable
ACCIDENT |
|
Weighting variable
None |
|
Number of observations
778 |
|
Iterations completed
8 |
|
Log likelihood function -1821.510
|
|
Restricted log likelihood -1883.921
|
|
Chi squared
124.8218 |
|
Degrees of freedom
2 |
|
Prob[ChiSqd > value] = .0000000
|
|
Chi- squared = 4944.94781 RsqP= -.0051 |
|
G - squared = 2827.20794 RsqD= .0423
|
|
Overdispersion tests: g=mu(i) : 4.720 |
|
Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.3900961420 .46678663E-01 8.357 .0000
EMPS
.5418599057E-02 .74341923E-03 7.289 .0000
42.012853
STRICT
-.7041663804 .66761926E-01 -10.547 .0000
.50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors (Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291 versus -1821.5101). This difference seems to come from the generalized linear model that the GENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that they result in the same likelihood ratio.
Top
The SAS GENMODE procedure, STATA .nbreg command, and LIMDEP Negbin$ command estimate the negative binomial regression model (NBRM).
3.1 NBRM in SAS
The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that the dispersion parameter is equivalent to the alpha in STATA and LIMDEP.
PROC GENMOD DATA = masil.accident;
MODEL accident=emps strict /DIST=NEGBIN LINK=LOG;
RUN;
The GENMOD Procedure
Model Information
Data Set COUNT.WASTE
Distribution Negative
Binomial
Link Function
Log
Dependent Variable
Accident
Observations Used
778
Criteria For Assessing Goodness Of Fit
Criterion
DF Value
Value/DF
Deviance
775 589.7752 0.7610
Scaled Deviance 775
589.7752 0.7610
Pearson Chi-Square 775
845.6033 1.0911
Scaled Pearson X2 775
845.6033 1.0911
Log
Likelihood
37.5628
Algorithm converged.
Analysis Of Parameter Estimates
Standard Wald 95% Confidence
Chi-
Parameter DF Estimate
Error Limits
Square Pr > ChiSq
Intercept 1 0.3851
0.1278 0.1345 0.6357
9.07 0.0026
Emps 1
0.0052 0.0023 0.0008
0.0096 5.29
0.0214
Strict 1
-0.6703 0.1671 -0.9978
-0.3427 16.09
<.0001
Dispersion 1 3.9554
0.3501 3.3254 4.7048
NOTE:
The negative binomial dispersion parameter was estimated by maximum likelihood.
The restricted model produces a log likelihood of 28.8627. Thus, the likelihood ratio for goodness-of-fit is 17.4002 = 2 * (37.5628 – 28.8627) (p<.00017).
PROC GENMOD DATA = masil.accident;
MODEL accident= /DIST=NEGBIN LINK=LOG;
RUN;
The likelihood ratio for overdispersion is 1409.5838 = 2 * (37.5628 - (-667.2291)).
Top
3.2 NBRM in STATA
STATA has the .nbreg command for the NBRM. The command reports three log likelihood statistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (full model), which make it easy to conduct likelihood ratio tests.
. nbreg accident emps strict
Fitting
comparison Poisson model:
Iteration
0: log likelihood = -1821.5112
Iteration
1: log likelihood = -1821.5101
Iteration
2: log likelihood = -1821.5101
Fitting
constant-only model:
Iteration
0: log likelihood = -1256.6761
Iteration
1: log likelihood = -1152.6155
Iteration
2: log likelihood = -1125.6643
Iteration
3: log likelihood = -1125.4183
Iteration
4: log likelihood = -1125.4183
Fitting
full model:
Iteration
0: log likelihood = -1117.1731
Iteration
1: log likelihood = -1116.7201
Iteration
2: log likelihood = -1116.7182
Iteration
3: log likelihood = -1116.7182
Negative
binomial regression
Number of
obs = 778
LR chi2(2) = 17.40
Prob > chi2 = 0.0002
Log
likelihood = -1116.7182
Pseudo R2 = 0.0077
------------------------------------------------------------------------------
accident | Coef. Std. Err.
z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
emps | .0051981 .0022595 2.30
0.021 .0007694 .0096267
strict | -.6702548 .1671191 -4.01
0.000 -.9978021 -.3427074
_cons | .3851111 .1278468 3.01
0.003 .134536 .6356861
-------------+----------------------------------------------------------------
/lnalpha | 1.37509 .0885176
1.201599 1.548582
-------------+----------------------------------------------------------------
alpha | 3.955434 .3501257
3.32543 4.704793
------------------------------------------------------------------------------
Likelihood
ratio test of alpha=0: chibar2(01) = 1409.58 Prob>=chibar2 = 0.000
The restricted model or “constant-only model” gives us a log likelihood -1125.4183. Thus, the likelihood ratio for goodness-of-fit is 17.4002 = 2 * [-1116.7182 - (-1125.4183)] (p<.00017). The p-value is computed as follows (Note the .disp or .di is an abbreviation of the .display).
. disp chi2tail(2, 17.4002)
.00016657
The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p<.0000) and rejects the null hypothesis of alpha=0. The statistically significant evidence of overdispersion indicates that the NBRM is preferred to the PRM.
. di 2 * (-1116.7182 - (-1821.5101))
1409.5838
The p-value of the likelihood ratio for overdispersion is computed as,
. di chi2tail(1, 1409.5838)
1.74e-308
Now, let us calculate marginal effects (or changes) at the means of independent variables. You should the read the discrete change labeled “0->1” of a binary variable strict, since its marginal change at the mean (.5077) is meaningless.
. prchange
nbreg:
Changes in Predicted Rate for accident
min->max 0->1 -+1/2
-+sd/2 MargEfct
emps 1.5326 0.0055 0.0068
0.2585 0.0068
strict
-0.8931 -0.8931 -0.8885 -0.4383
-0.8721
exp(xb):
1.3011
emps strict
x= 42.0129 .507712
sd(x)=
38.1548 .500262
Top
3.3 NBRM in LIMDEP
LIMDEP has the Negbin$ command for the NBRM that reports the PRM as well. Note that the standard errors of parameter estimates are slightly different from those of SAS and STATA. The Marginal Effects$ and the Means$ subcommands compute marginal effects at the mean of independent variables. You may not omit the Means$ subcommand.
NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
Marginal Effects;
Means$
+---------------------------------------------+
|
Poisson Regression
|
|
Maximum Likelihood Estimates
|
|
Model estimated: Sep 08, 2005 at 09:35:36AM.|
|
Dependent variable
ACCIDENT |
|
Weighting variable
None |
|
Number of observations
778 |
|
Iterations completed
8 |
|
Log likelihood function -1821.510
|
|
Restricted log likelihood -1883.921
|
|
Chi squared
124.8218 |
|
Degrees of freedom
2 |
|
Prob[ChiSqd > value] = .0000000
|
|
Chi- squared = 4944.94781 RsqP= -.0051 |
|
G - squared = 2827.20794 RsqD= .0423
|
|
Overdispersion tests: g=mu(i) : 4.720 |
|
Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.3900961420 .46678663E-01 8.357 .0000
EMPS
.5418599057E-02 .74341923E-03 7.289 .0000
42.012853
STRICT
-.7041663804 .66761926E-01 -10.547 .0000
.50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal
exit from iterations. Exit status=0.
+---------------------------------------------+
|
Negative Binomial Regression
|
|
Maximum Likelihood Estimates
|
|
Model estimated: Sep 08, 2005 at 09:35:36AM.|
|
Dependent variable
ACCIDENT |
|
Weighting variable
None |
|
Number of observations
778 |
|
Iterations completed
8 |
|
Log likelihood function -1116.718
|
|
Restricted log likelihood -1821.510
|
|
Chi squared
1409.584 |
|
Degrees of freedom
1 |
|
Prob[ChiSqd > value] = .0000000
|
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.3851110699 .12855240
2.996 .0027
EMPS
.5198057234E-02 .22602075E-02 2.300 .0215
42.012853
STRICT
-.6702547660 .16729839 -4.006
.0001 .50771208
Dispersion parameter for count data model
Alpha
3.955434012 .35680876 11.086
.0000
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
+-------------------------------------------+
|
Partial derivatives of expected val. with |
|
respect to the vector of characteristics. |
|
They are computed at the means of the Xs. |
|
Observations used for means are All Obs. |
|
Conditional Mean at Sample Point 1.3011 |
|
Scale Factor for Marginal Effects 1.3011 |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.5010628939 .19396434
2.583 .0098
EMPS
.6763123170E-02 .29746591E-02 2.274 .0230
42.012853
STRICT
-.8720595665 .22469308 -3.881
.0001 .50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
Read the coefficients (.0068 and -.8721) to confirm that they are identical to the corresponding marginal effects calculated in STATA.
SAS, STATA, and LIMDEP produce almost the same parameter estimates and goodness-of-fit statistics (Table 4). Note that SAS reports different log likelihoods, but the same likelihood ratio.
Figure 3 compares the PRM and NBRM. Look at the predictions for zero counts of the two models. As the likelihood ratio test indicates, the NBRM seems to fit these data better than PRM.
Figure 3. Comparison of the Poisson and Negative Binomial Regression Models
Top
STATA and LIMDEP have commands for the zero-inflated Poisson regression model (ZIP).
4.1 ZIP in STATA
STATA has the .zip command to estimate the ZIP. The inflate() option specifies a list of variables that determines whether the observed count is zero. The vuong option computes the Vuong statistic to compare the ZIP and PRM.
. zip accident emps strict, inflate(emps strict) vuong
Fitting
constant-only model:
Iteration
0: log likelihood = -1627.0779
Iteration
1: log likelihood = -1309.5825
Iteration
2: log likelihood = -1272.433
Iteration
3: log likelihood = -1270.9543
Iteration
4: log likelihood = -1270.9523
Iteration
5: log likelihood = -1270.9523
Fitting
full model:
Iteration
0: log likelihood = -1270.9523
Iteration
1: log likelihood = -1269.7219
Iteration
2: log likelihood = -1269.7206
Iteration
3: log likelihood = -1269.7206
Zero-inflated
Poisson regression
Number of obs = 778
Nonzero obs =
280
Zero
obs =
498
Inflation
model = logit
LR chi2(2) =
2.46
Log
likelihood = -1269.721
Prob > chi2 = 0.2918
------------------------------------------------------------------------------
| Coef. Std. Err.
z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
accident
|
emps | -.000277 .0008633 -0.32
0.748 -.001969 .001415
strict | -.0923911 .0729023 -1.27
0.205 -.2352771 .0504948
_cons | 1.361978 .0493222 27.61
0.000 1.265308 1.458647
-------------+----------------------------------------------------------------
inflate
|
emps | -.0109897 .0022678 -4.85
0.000 -.0154344 -.006545
strict | 1.057031 .1767509 5.98
0.000 .7106059 1.403457
_cons | .488656 .1211099
4.03 0.000 .2512849
.726027
------------------------------------------------------------------------------
Vuong
test of zip vs. standard Poisson:
z = 8.40 Pr>z = 0.0000
The restricted model is estimated with the intercept only.
. zip accident, inflate(emps strict)
The Vuong statistic at the bottom compares the ZIP and PRM. Since the V 8.40 is greater than 1.96, we conclude that the ZIP is preferred to the PRM.
Top
4.2 ZIP in LIMDEP
The LIMDEP Poisson$ command needs to have the Zip and Rh2 subcommands. The Rh2 is equivalent to the inflate() option in STATA. The Alg=Newton$ subcommand is needed to use the Newton-Raphson algorithm because the default Broyden algorithm failed to converge.
POISSON;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton$
+---------------------------------------------+
|
Poisson Regression
|
|
Maximum Likelihood Estimates
|
|
Model estimated: Sep 06, 2005 at 00:25:07PM.|
|
Dependent variable
ACCIDENT |
|
Weighting variable
None |
|
Number of observations
778 |
|
Iterations completed
8 |
|
Log likelihood function -1821.510
|
|
Restricted log likelihood -1883.921
|
|
Chi squared
124.8218 |
|
Degrees of freedom
2 |
|
Prob[ChiSqd > value] = .0000000
|
|
Chi- squared = 4944.94781 RsqP= -.0051 |
|
G - squared = 2827.20794 RsqD= .0423
|
|
Overdispersion tests: g=mu(i) : 4.720 |
|
Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.3900961420 .46678663E-01 8.357 .0000
EMPS
.5418599057E-02 .74341923E-03 7.289 .0000
42.012853
STRICT
-.7041663804 .66761926E-01 -10.547 .0000
.50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal
exit from iterations. Exit status=0.
+----------------------------------------------------------------------+
|
Zero Altered Poisson Regression Model
|
|
Logistic distribution used for splitting model.
|
|
ZAP term in probability is F[tau x Z(i) ]
|
|
Comparison of estimated models
|
|
Pr[0|means] Number of zeros
Log-likelihood |
|
Poisson .27329
Act.= 498 Prd.= 212.6 -1821.51007
|
|
Z.I.Poisson .64642 Act.=
498 Prd.= 502.9 -1259.88568 |
|
Note, the ZIP log-likelihood is not directly comparable.
|
|
ZIP model with nonzero Q does not encompass the others.
|
|
Vuong statistic for testing ZIP vs. unaltered model is
9.5740 |
|
Distributed as standard normal. A value greater than
|
|
+1.96 favors the zero altered Z.I.Poisson model.
|
|
A value less than -1.96 rejects the ZIP model.
|
+----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Poisson/NB/Gamma regression model
Constant
1.361977491 .23944641E-01 56.880 .0000
EMPS
-.2770010575E-03 .37770090E-03 -.733 .4633
42.012853
STRICT
-.9239125073E-01 .33326502E-01 -2.772 .0056
.50771208
Zero inflation model
Constant
.4886559537 .12210013
4.002 .0001
EMPS
-.1098971050E-01 .22152492E-02 -4.961 .0000
42.012853
STRICT
1.057031399 .17715551
5.967 .0000 .50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
In order to estimate the restricted model, run the following command with the ONE only in the Lhs$ subcommand. The Rh2$ subcommand remains unchanged.
POISSON;
Lhs=ACCIDENT;
Rhs=ONE;
Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton$
STATA and LIMDEP report the same parameter estimates, but they produce different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious log likelihood for the restricted model, and thus ended up with the “unlikely” likelihood ratio of -.0304. In addition, the Vuong statistics in STATA and LIMDEP are different.
Top
STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB).
5.1 ZINB in STATA
The STATA .zinb command estimates the ZINB. The vuong option computes the Vuong statistic to compare the ZINB and NBRM.
. zinb accident emps strict, inflate(emps strict) vuong
Fitting
constant-only model:
Iteration
0: log likelihood = -1190.5117 (not concave)
Iteration
1: log likelihood = -1106.9874
Iteration
2: log likelihood = -1098.8642
Iteration
3: log likelihood = -1095.3638
Iteration
4: log likelihood = -1094.0237
Iteration
5: log likelihood = -1093.063
Iteration
6: log likelihood = -1092.6216
Iteration
7: log likelihood = -1091.798
Iteration
8: log likelihood = -1091.7332
Iteration
9: log likelihood = -1091.7329
Iteration
10: log likelihood = -1091.7329
Fitting
full model:
Iteration
0: log likelihood = -1091.7329
Iteration
1: log likelihood = -1089.5565
Iteration
2: log likelihood = -1089.5198
Iteration
3: log likelihood = -1089.5198
Zero-inflated
negative binomial regression Number
of obs = 778
Nonzero
obs = 280
Zero obs =
498
Inflation
model = logit
LR chi2(2) =
4.43
Log
likelihood = -1089.52
Prob > chi2 = 0.1094
------------------------------------------------------------------------------
| Coef. Std. Err.
z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
accident
|
emps | -.0004407 .0020554 -0.21
0.830 -.0044691 .0035877
strict | -.3251317 .1659173 -1.96
0.050 -.6503235 .0000602
_cons | .7763065 .1508037 5.15
0.000 .4807367 1.071876
-------------+----------------------------------------------------------------
inflate
|
emps | -.2087768 .0955122 -2.19
0.029 -.3959772 -.0215763
strict | 7.562388 3.055775 2.47
0.013 1.573179 13.5516
_cons | .1032115 .3800045 0.27
0.786 -.6415835 .8480065
-------------+----------------------------------------------------------------
/lnalpha | .9252514 .1351387
6.85 0.000 .6603845 1.190118
-------------+----------------------------------------------------------------
alpha | 2.522502 .3408876
1.935536 3.28747
------------------------------------------------------------------------------
Vuong
test of zinb vs. standard negative binomial: z = 4.13
Pr>z = 0.0000
The likelihood ratio, 360.4024= 2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of no overdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p<.0000). The Vuong test, 4.13 > 1.96, suggests that the ZINB is preferred to the NBRM.
Top
5.2 ZINB in LIMDEP
The LIMDEP Negbin$ command needs to have the Zip and Rh2 subcommands for the ZINB. The following command produces the Poisson regression model, negative binomial model, and zero-inflated negative binomial model. You may omit the Alg=Newton$ subcommand.
NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton$
+---------------------------------------------+
|
Poisson Regression |
|
Maximum Likelihood Estimates
|
|
Model estimated: Sep 10, 2005 at 00:20:00AM.|
|
Dependent variable
ACCIDENT |
|
Weighting variable
None |
|
Number of observations
778 |
|
Iterations completed
8 |
|
Log likelihood function -1821.510
|
|
Restricted log likelihood -1883.921
|
|
Chi squared
124.8218 |
|
Degrees of freedom
2 |
|
Prob[ChiSqd > value] = .0000000
|
|
Chi- squared = 4944.94781 RsqP= -.0051 |
|
G - squared = 2827.20794 RsqD= .0423
|
|
Overdispersion tests: g=mu(i) : 4.720 |
|
Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.3900961420 .46678663E-01 8.357 .0000
EMPS
.5418599057E-02 .74341923E-03 7.289 .0000
42.012853
STRICT
-.7041663804 .66761926E-01 -10.547 .0000
.50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal
exit from iterations. Exit status=0.
+---------------------------------------------+
|
Negative Binomial Regression
|
|
Maximum Likelihood Estimates
|
|
Model estimated: Sep 10, 2005 at 00:20:00AM.|
|
Dependent variable ACCIDENT
|
|
Weighting variable
None |
|
Number of observations
778 |
|
Iterations completed
12 |
|
Log likelihood function -1116.718
|
|
Restricted log likelihood -1821.510
|
|
Chi squared
1409.584 |
|
Degrees of freedom
1 |
|
Prob[ChiSqd > value] = .0000000
|
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant
.3851110482 .12855240
2.996 .0027
EMPS
.5198057322E-02 .22602075E-02 2.300 .0215
42.012853
STRICT
-.6702547787 .16729839 -4.006
.0001 .50771208
Dispersion parameter for count data model
Alpha
3.955434128 .35680877 11.086
.0000
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
Normal
exit from iterations. Exit status=0.
+----------------------------------------------------------------------+
|
Zero Altered Neg.Binomial Regression Model
|
|
Logistic distribution used for splitting model.
|
|
ZAP term in probability is F[tau x Z(i) ]
|
|
Comparison of estimated models
|
|
Pr[0|means] Number of zeros
Log-likelihood |
|
Poisson .27329
Act.= 498 Prd.= 212.6 -1821.51007
|
|
Neg. Bin. .32470 Act.=
498 Prd.= 252.6 -1116.71820 |
|
Z.I.Neg_Bin .62918 Act.=
498 Prd.= 489.5 -1089.51977 |
|
Note, the ZIP log-likelihood is not directly comparable.
|
|
ZIP model with nonzero Q does not encompass the others.
|
|
Vuong statistic for testing ZIP vs. unaltered model is
4.1270 |
|
Distributed as standard normal. A value greater than
|
|
+1.96 favors the zero altered Z.I.Neg_Bin model.
|
|
A value less than -1.96 rejects the ZIP model.
|
+----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable
| Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Poisson/NB/Gamma regression model
Constant
.7763063017 .15178042
5.115 .0000
EMPS
-.4407244013E-03 .20262626E-02 -.218 .8278
42.012853
STRICT
-.3251315411 .16179883 -2.009
.0445 .50771208
Dispersion parameter
Alpha
2.522502810 .29924002
8.430 .0000
Zero inflation model
Constant
.1032103951 .37413759
.276 .7827
EMPS
-.2087767804 .68774937E-01 -3.036 .0024
42.012853
STRICT
7.562389399 2.2216392
3.404 .0007 .50771208
(Note:
E+nn or E-nn means multiply by 10 to + or -nn power.)
In order to estimate the restricted model, run the following command. You have to use the Alg=Newton$ subcommand to get the restricted model to converge.
NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE;
Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton$
Figure 4. Comparison of the Zero-Inflated PRM and the Zero-Inflated NBRM
Top
Like other econometric models, researchers must first examine the data generation process of a dependent variable to understand its behavior. Sophisticated researchers pay special attention to excess zeros, censored and/or truncated counts, sample selection, and other particular patterns of the data generation, and then decide which model best describes the data generation process.
The Poisson regression model and negative binomial regression model have the same mean structure, but they describe the behavior of a dependent variable in different ways. Zero-inflated regression models integrate two different data generation processes to deal with overdispersion. Truncated or censored regression models are appropriate when data are (left and/or right) truncated or censored.
Researchers need to spend more time and effort interpreting the results substantively. Like other categorical dependent variable models, count data models produce estimates that are difficult to interpret intuitively. Reporting parameter estimates and goodness-of-fit statistics are not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful count data model interpretations.
Regarding statistical software, I would recommend STATA for general count data models and LIMDEP for special types of models. Although able to handle various models, LIMDEP does not seem stable and reliable. The SAS GENMODE procedure estimates the Poisson regression model and the negative binomial model, but it does not have easy ways of estimating other models. We encourage SAS Institute to develop an individual procedure, say the CLIM (Count and Limited Dependent Variable Model) procedure, to handle a variety of count data models.
Top
The data set used here is a part of the data provided for David H. Good’s class in the School of Public and Environmental Affairs, Indiana University. Note that these data have been manipulated for the sake of data security. The variables in the data set include,
- emps: the size of the waste quotas
- strict: strictness of policy implementation (1=strict)
- accident: the frequency of waste spill accidents of plant
Download (UITS Stat/Math Center):
ASCII Text |
STATA |
SAS
The followings summarize descriptive statistics of these variables. Note that there are many zero counts that indicate an overdispersion problem.
. summarize accident emps strict
Variable | Obs
Mean Std. Dev. Min
Max
-------------+--------------------------------------------------------
accident | 778 1.372751
2.948267 0
31
emps | 778 42.01285
38.1548 1
174
strict | 778 .5077121
.5002621 0
1
. tab accident strict
| strict
accident | 0
1 | Total
-----------+----------------------+----------
0 | 214
284 | 498
1 | 41
29 | 70
2 | 38
32 | 70
3 | 28
13 | 41
4 | 16
13 | 29
5 | 10
3 | 13
6 | 12
7 | 19
7 | 4
3 | 7
8 | 4
2 | 6
9 | 3
2 | 5
10 | 0
2 | 2
11 | 3
1 | 4
12 | 2
0 | 2
13 | 1
0 | 1
14 | 1
0 | 1
15 | 3
0 | 3
16 | 1
0 | 1
17 | 0
2 | 2
18 | 1
1 | 2
21 | 0
1 | 1
31 | 1
0 | 1
-----------+----------------------+----------
Total | 383
395 | 778
Top
- Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute.
- Cameron, A. Colin, and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.
- Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall.
- Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed. Plainview, New York: Econometric Software.
- Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using STATA, 2nd ed. College Station, TX: STATA Press.
- Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
- Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press.
- SAS Institute. 2004. SAS/STAT 9.1 User's Guide, Version 8. Cary, NC: SAS Institute.
- STATA Press. 2005. STATA Base Reference Manual, Release 9. College Station, TX: STATA Press.
I am grateful to Jeremy Albright and Kevin Wilhite at the Center for Statistical and Mathematical Computing, Indiana University, who provided valuable comments and suggestions.
- 2003. First draft.
- 2004. Second draft.
- 2005. Third draft (Added LIMDEP examples).