Stat/Math
Software Support
Software Consulting
Software Availability
Software Price
Contact

User Support
Documentation
Knowledge Base
Education
Consulting
Podcasts

Systems & Services
Cyberinfrastructure
Supercomputers
Grid Computing
Storage
Visualization
Digital Libraries & Data

Results & Impact
Publications
Grants & Grant Info
Events & Outreach
Economic Impact
Survey Results

Vision & Planning
News & Features

### 2. Least Squares Dummy Variable Regression

A dummy variable is a binary variable that is coded either 1 or zero. It is commonly used to examine group and time effects in regression. Consider a simple model of regressing R&D expenditure in 2002 on 2000 net income and firm type. The dummy variable d1 is set to 1 for equipment and software firms and zero for telecommunication and electronics. The variable d2 is coded in the opposite way. Take a look at the data structure (Figure 2).

Figure 2. Dummy Variable Coding for Firm Type

+-----------------------------------------------------------------+
|             firm      rnd    income              type   d1   d2 |
|-----------------------------------------------------------------|
|          Samsung    2,500     4,768       Electronics    0    1 |
|             AT&T      254     4,669           Telecom    0    1 |
|              IBM    4,750     8,093      IT Equipment    1    0 |
|          Siemens    5,490     6,528       Electronics    0    1 |
|          Verizon        .    11,797           Telecom    0    1 |
|        Microsoft    3,772     9,421     Service & S/W    1    0 |
...            ...      ...       ...               ...   ...   ...

2.1 Model 1 without a Dummy Variable

The ordinary least squares (OLS) regression without dummy variables, a pooled regression model, assumes a constant intercept and slope regardless of firm types. In the following regression equation, beta0 is the intercept; beta1 is the slope of net income in 2000; and e is the error term.

Model 1: R&D = beta0 + beta1*income + e

The pooled model has the intercept of 1,482.697 and slope of .223. For a \$ one million increase in net income, a firm is likely to increase R&D expenditure in 2002 by \$ .223 million.

. regress rnd income

Source |       SS       df       MS              Number of obs =      39
-------------+------------------------------           F(  1,    37) =    7.07
Model |  15902406.5     1  15902406.5           Prob > F      =  0.0115
Residual |  83261299.1    37  2250305.38           R-squared     =  0.1604
-------------+------------------------------           Adj R-squared =  0.1377
Total |  99163705.6    38   2609571.2           Root MSE      =  1500.1

------------------------------------------------------------------------------
rnd |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
income |   .2230523   .0839066     2.66   0.012     .0530414    .3930632
_cons |   1482.697   314.7957     4.71   0.000     844.8599    2120.533
------------------------------------------------------------------------------
Pooled model: R&D = 1,482.697 + .223*income

Despite moderate goodness of fit statistics such as F and t, this is a naive model. R&D investment tends to vary across industries.

Top

2.2 Model 2 with a Dummy Variable

You may assume that equipment and software firms have more R&D expenditure than other types of companies. Let us take this group difference into account. We have to drop one of the two dummy variables in order to avoid perfect multicollinearity. That is, OLS does not work with both dummies in a model. The in model 2 is the coefficient that is valid in equipment and software companies only.

Model 1: R&D = beta0 + beta1*income + delta1*d1 + e

Unlike Model 1, this model results in two different regression equations for two groups. The difference lies in the intercepts, but the slope remains unchanged.

. regress rnd income d1

Source |       SS       df       MS              Number of obs =      39
-------------+------------------------------           F(  2,    36) =    6.06
Model |  24987948.9     2  12493974.4           Prob > F      =  0.0054
Residual |  74175756.7    36  2060437.69           R-squared     =  0.2520
-------------+------------------------------           Adj R-squared =  0.2104
Total |  99163705.6    38   2609571.2           Root MSE      =  1435.4

------------------------------------------------------------------------------
rnd |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
income |   .2180066   .0803248     2.71   0.010     .0551004    .3809128
d1 |   1006.626   479.3717     2.10   0.043     34.41498    1978.837
_cons |   1133.579   344.0583     3.29   0.002     435.7962    1831.361
------------------------------------------------------------------------------
d1=1: R&D = 2,140.205 + .218*income = 1,113.579 +1,006.626*1 + .218*income
d1=0: R&D = 1,133.579 + .218*income = 1,113.579 +1,006.626*0 + .218*income

The slope .218 indicates a positive impact of two-year-lagged net income on a firm’s R&D expenditure. Equipment and software firms on average spend \$1,007 million more for R&D than telecommunication and electronics companies.

Top

2.3 Visualization of Model 1 and 2

There is only a tiny difference in the slope (.223 versus .218) between Model 1 and Model 2. The intercept 1,483 of Model 1, however, is quite different from 1,134 for equipment and software companies and 2,140 for telecommunications and electronics in Model 2. This result appears to support Model 2.

Figure 3 highlights differences between Model 1 and 2 more clearly. The black line (pooled) in the middle is the regression line of Model 1; the red line at the top is one for equipment and software companies (d1=1) in Model 2; finally the blue line at the bottom is for telecommunication and electronics firms (d2=1 or d1=0).

Figure 3. Regression Lines of Model 1 and Model 2

This plot shows that Model 1 ignores the group difference, and thus reports the misleading intercept. The difference in the intercept between two groups of firms looks substantial. Moreover, the two models have the similar slopes. Consequently, Model 2 considering fixed group effects seems better than the simple Model 1. Compare goodness of fit statistics (e.g., F, t, R2, and SSE) of the two models. See Section 3.2.2 and 4.7 for formal hypothesis testing.

Top

2.4 Alternatives to LSDV1

The least squares dummy variable (LSDV) regression is ordinary least squares (OLS) with dummy variables. The critical issue in LSDV is how to avoid the perfect multicollinearity or the so called “dummy variable trap.?LSDV has three approaches to avoid getting caught in the trap. They produce different parameter estimates of dummies, but their results are equivalent.

The first approach, LSDV1, drops a dummy variable as in Model 2 above. The second approach includes all dummies and, in turn, suppresses the intercept (LSDV2). Finally, include the intercept and all dummies, and then impose a restriction that the sum of parameters of all dummies is zero (LSDV3). Take a look at the following functional forms to compare these three LSDVs.

Model 1: R&D = beta0 + beta1*income + delta1*d1 + e or R&D = beta0 + beta1*income + delta2*d2 + e
Model 2: R&D = beta1*income + delta1*d1 + delta2*d2 + e
Model 3: R&D = beta0 + beta1*income + delta1*d1 + delta2*d2 + e, subject to delta1 + delta2 =0

The main differences among these approaches exist in the meanings of the dummy variable parameters. Each approach defines the coefficients of dummy variables in different ways (Table 3). The parameter estimates in LSDV2 are actual intercepts of groups, making it easy to interpret substantively. LSDV1 reports differences from the reference point (dropped dummy variable). LSDV3 computes how far parameter estimates are away from the average group effect. Accordingly, null hypotheses of t-tests in the three approaches are different. Keep in mind that the R2 of LSDV2 is not correct. Table 3 contrasts the three LSDVs.

Table 3. Three Approaches of Least Squares Dummy Variable Models

Top

2.5 Estimating Three LSDVs

The SAS REG procedure, Stata .regress command, LIMDEP Regress\$ command, and SPSS Regression command all fit OLS and LSDVs. Let us estimate three LSDVs using SAS and Stata.

2.5.1 LSDV 1 without a Dummy

LSDV 1 drops a dummy variable. The intercept is the actual parameter estimate of the dropped dummy variable. The coefficient of the dummy included means how far its parameter estimate is away from the reference point or baseline (i.e., the intercept).

Here we include d2 instead of d1 to see how a different reference point changes the result. Check the sign of the dummy coefficient included and the intercept. Dropping other dummies does not make any significant difference.

PROC REG DATA=masil.rnd2002;
MODEL rnd = income d2;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: rnd

Number of Observations Read                         50
Number of Observations Used                         39
Number of Observations with Missing Values          11

Analysis of Variance

Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     2       24987949       12493974       6.06    0.0054
Error                    36       74175757        2060438
Corrected Total          38       99163706

Root MSE           1435.42248    R-Square     0.2520
Dependent Mean     2023.56410    Adj R-Sq     0.2104
Coeff Var            70.93536

Parameter Estimates

Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1     2140.20468      434.48460       4.93      <.0001
income        1        0.21801        0.08032       2.71      0.0101
d2            1    -1006.62593      479.37174      -2.10      0.0428
d2=0: R&D = 2,140.205 + .218*income = 2,140.205 - 1,006.626*0 + .218*income
d2=1: R&D = 1,133.579 + .218*income = 2,140.205 - 1,006.626*1 + .218*income

2.5.2 LSDV 2 without the Intercept

LSDV 2 includes all dummy variables and suppresses the intercept. The Stata .regress command has the noconstant option to fit LSDV2. The coefficients of dummies are actual parameter estimates; thus, you do not need to compute intercepts of groups. This LSDV, however, reports wrong R2.

. regress rnd income d1 d2, noconstant

Source |       SS       df       MS              Number of obs =      39
-------------+------------------------------           F(  3,    36) =   29.88
Model |   184685604     3  61561868.1           Prob > F      =  0.0000
Residual |  74175756.7    36  2060437.69           R-squared     =  0.7135
-------------+------------------------------           Adj R-squared =  0.6896
Total |   258861361    39  6637470.79           Root MSE      =  1435.4

------------------------------------------------------------------------------
rnd |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
income |   .2180066   .0803248     2.71   0.010     .0551004    .3809128
d1 |   2140.205   434.4846     4.93   0.000     1259.029     3021.38
d2 |   1133.579   344.0583     3.29   0.002     435.7962    1831.361
------------------------------------------------------------------------------
d1=1: R&D = 2,140.205 + .218*income
d2=1: R&D = 1,133.579 + .218*income

2.5.3 LSDV 3 with a Restriction

LSDV 3 includes the intercept and all dummies and then imposes a restriction on the model. The restriction is that the sum of all dummy parameters is zero. The Stata .constraint command defines a constraint, while the .cnsreg command fits a constrained OLS using the constraint()option. The number in the parenthesis indicates the constraint number defined in the .constraint command.

. constraint 1 d1 + d2 = 0
. cnsreg rnd income d1 d2, constraint(1)

Constrained linear regression                          Number of obs =      39
F(  2,    36) =    6.06
Prob > F      =  0.0054
Root MSE      =  1435.4
( 1)  d1 + d2 = 0
------------------------------------------------------------------------------
rnd |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
income |   .2180066   .0803248     2.71   0.010     .0551004    .3809128
d1 |    503.313   239.6859     2.10   0.043     17.20749    989.4184
d2 |   -503.313   239.6859    -2.10   0.043    -989.4184   -17.20749
_cons |   1636.892   310.0438     5.28   0.000     1008.094     2265.69
------------------------------------------------------------------------------
d1=1: R&D = 2,140.205 + .218*income = 1,637 + 503 *1 + (-503)*0 + .218*income
d2=1: R&D = 1,133.579 + .218*income = 1,637 + 503 *0 + (-503)*1 + .218*income

The intercept is the average of actual parameter estimates: 1,636 = (2,140+1,133)/2. In the SAS output below, the coefficient of RESTRICT is virtually zero and, in theory, should be zero.

PROC REG DATA=masil.rnd2002;
MODEL rnd = income d1 d2;
RESTRICT d1 + d2 = 0;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: rnd

NOTE: Restrictions have been applied to parameter estimates.

Number of Observations Read                         50
Number of Observations Used                         39
Number of Observations with Missing Values          11

Analysis of Variance

Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     2       24987949       12493974       6.06    0.0054
Error                    36       74175757        2060438
Corrected Total          38       99163706

Root MSE           1435.42248    R-Square     0.2520
Dependent Mean     2023.56410    Adj R-Sq     0.2104
Coeff Var            70.93536

Parameter Estimates

Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1     1636.89172      310.04381       5.28     <.0001
income        1        0.21801        0.08032       2.71     0.0101
d1            1      503.31297      239.68587       2.10     0.0428
d2            1     -503.31297      239.68587      -2.10     0.0428
RESTRICT     -1    1.81899E-12              0        .        .

lang=EN-US                         * Probability computed using beta distribution.

Table 4 compares how SAS, Stata, LIMDEP, and SPSS conducts LSDVs. SPSS is not able to fit the LSDV3. In LIMDEP, the b(2) of the Cls: indicates the parameter estimate of the second independent variable. In SPSS, pay attention to the /ORIGIN option for LSDV2.

Top

Table 4. Estimating Three LSDVs Using SAS, Stata, LIMDEP, and SPSS