2. Least Squares Dummy Variable Regression
A dummy variable is a binary variable that is coded either 1 or zero. It is commonly used to examine group and time effects in regression. Consider a simple model of regressing R&D expenditure in 2002 on 2000 net income and firm type. The dummy variable d1 is set to 1 for equipment and software firms and zero for telecommunication and electronics. The variable d2 is coded in the opposite way. Take a look at the data structure (Figure 2).
Figure 2. Dummy Variable Coding for Firm Type
+-----------------------------------------------------------------+
|
firm rnd income
type d1 d2 |
|-----------------------------------------------------------------|
|
Samsung 2,500
4,768 Electronics
0 1 |
|
AT&T 254
4,669
Telecom 0 1 |
|
IBM 4,750 8,093
IT Equipment 1 0 |
|
Siemens 5,490
6,528 Electronics
0 1 |
|
Verizon .
11,797
Telecom 0 1 |
| Microsoft
3,772 9,421 Service &
S/W 1 0 |
...
...
...
...
...   ... ...
2.1 Model 1 without a Dummy Variable
The ordinary least squares (OLS) regression without dummy variables, a pooled regression model, assumes a constant intercept and slope regardless of firm types. In the following regression equation, beta0 is the intercept; beta1 is the slope of net income in 2000; and e is the error term.
Model 1: R&D = beta0 + beta1*income + e
The pooled model has the intercept of 1,482.697 and slope of .223. For a $ one million increase in net income, a firm is likely to increase R&D expenditure in 2002 by $ .223 million.
. regress rnd income
Source |
SS df
MS
Number of obs = 39
-------------+------------------------------
F( 1, 37) = 7.07
Model | 15902406.5 1
15902406.5 Prob
> F = 0.0115
Residual | 83261299.1 37
2250305.38
R-squared = 0.1604
-------------+------------------------------
Adj R-squared = 0.1377
Total | 99163705.6 38
2609571.2 Root
MSE = 1500.1
------------------------------------------------------------------------------
rnd
| Coef. Std.
Err. t
P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income
|
.2230523 .0839066 2.66
0.012 .0530414 .3930632
_cons | 1482.697 314.7957
4.71 0.000 844.8599
2120.533
------------------------------------------------------------------------------
Pooled model: R&D = 1,482.697 + .223*income
Despite moderate goodness of fit statistics such as F and t, this is a naive model. R&D investment tends to vary across industries.
Top
2.2 Model 2 with a Dummy Variable
You may assume that equipment and software firms have more R&D expenditure than other types of companies. Let us take this group difference into account. We have to drop one of the two dummy variables in order to avoid perfect multicollinearity. That is, OLS does not work with both dummies in a model. The in model 2 is the coefficient that is valid in equipment and software companies only.
Model 1: R&D = beta0 + beta1*income + delta1*d1 + e
Unlike Model 1, this model results in two different regression equations for two groups. The difference lies in the intercepts, but the slope remains unchanged.
. regress rnd income d1
Source
| SS
df
MS
Number of obs = 39
-------------+------------------------------
F( 2, 36) = 6.06
Model | 24987948.9 2
12493974.4 Prob
> F = 0.0054
Residual | 74175756.7 36
2060437.69
R-squared = 0.2520
-------------+------------------------------
Adj R-squared = 0.2104
Total | 99163705.6 38
2609571.2 Root
MSE = 1435.4
------------------------------------------------------------------------------
rnd | Coef. Std. Err.
t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | .2180066 .0803248
2.71 0.010 .0551004
.3809128
d1 | 1006.626 479.3717
2.10 0.043 34.41498
1978.837
_cons | 1133.579 344.0583
3.29 0.002 435.7962
1831.361
------------------------------------------------------------------------------
d1=1: R&D = 2,140.205 + .218*income = 1,113.579 +1,006.626*1 + .218*income
d1=0: R&D = 1,133.579 + .218*income = 1,113.579 +1,006.626*0 + .218*income
The slope .218 indicates a positive impact of two-year-lagged net income on a firm’s R&D expenditure. Equipment and software firms on average spend $1,007 million more for R&D than telecommunication and electronics companies.
Top
2.3 Visualization of Model 1 and 2
There is only a tiny difference in the slope (.223 versus .218) between Model 1 and Model 2. The intercept 1,483 of Model 1, however, is quite different from 1,134 for equipment and software companies and 2,140 for telecommunications and electronics in Model 2. This result appears to support Model 2.
Figure 3 highlights differences between Model 1 and 2 more clearly. The black line (pooled) in the middle is the regression line of Model 1; the red line at the top is one for equipment and software companies (d1=1) in Model 2; finally the blue line at the bottom is for telecommunication and electronics firms (d2=1 or d1=0).
Figure 3. Regression Lines of Model 1 and Model 2
This plot shows that Model 1 ignores the group difference, and thus reports the misleading intercept. The difference in the intercept between two groups of firms looks substantial. Moreover, the two models have the similar slopes. Consequently, Model 2 considering fixed group effects seems better than the simple Model 1. Compare goodness of fit statistics (e.g., F, t, R2, and SSE) of the two models. See Section 3.2.2 and 4.7 for formal hypothesis testing.
Top
2.4 Alternatives to LSDV1
The least squares dummy variable (LSDV) regression is ordinary least squares (OLS) with dummy variables. The critical issue in LSDV is how to avoid the perfect multicollinearity or the so called “dummy variable trap.?LSDV has three approaches to avoid getting caught in the trap. They produce different parameter estimates of dummies, but their results are equivalent.
The first approach, LSDV1, drops a dummy variable as in Model 2 above. The second approach includes all dummies and, in turn, suppresses the intercept (LSDV2). Finally, include the intercept and all dummies, and then impose a restriction that the sum of parameters of all dummies is zero (LSDV3). Take a look at the following functional forms to compare these three LSDVs.
Model 1: R&D = beta0 + beta1*income + delta1*d1 + e or R&D = beta0 + beta1*income + delta2*d2 + e
Model 2: R&D = beta1*income + delta1*d1 + delta2*d2 + e
Model 3: R&D = beta0 + beta1*income + delta1*d1 + delta2*d2 + e, subject to delta1 + delta2 =0
The main differences among these approaches exist in the meanings of the dummy variable parameters. Each approach defines the coefficients of dummy variables in different ways (Table 3). The parameter estimates in LSDV2 are actual intercepts of groups, making it easy to interpret substantively. LSDV1 reports differences from the reference point (dropped dummy variable). LSDV3 computes how far parameter estimates are away from the average group effect. Accordingly, null hypotheses of t-tests in the three approaches are different. Keep in mind that the R2 of LSDV2 is not correct. Table 3 contrasts the three LSDVs.
Table 3. Three Approaches of Least Squares Dummy Variable Models
Top
2.5 Estimating Three LSDVs
The SAS REG procedure, Stata .regress command, LIMDEP Regress$ command, and SPSS Regression command all fit OLS and LSDVs. Let us estimate three LSDVs using SAS and Stata.
2.5.1 LSDV 1 without a Dummy
LSDV 1 drops a dummy variable. The intercept is the actual parameter estimate of the dropped dummy variable. The coefficient of the dummy included means how far its parameter estimate is away from the reference point or baseline (i.e., the intercept).
Here we include d2 instead of d1 to see how a different reference point changes the result. Check the sign of the dummy coefficient included and the intercept. Dropping other dummies does not make any significant difference.
PROC REG DATA=masil.rnd2002;
MODEL rnd = income d2;
RUN;
The REG Procedure
Model: MODEL1
Dependent Variable: rnd
Number of Observations
Read
50
Number of Observations
Used
39
Number of Observations with Missing
Values 11
Analysis
of Variance
Sum of Mean
Source
DF
Squares
Square F Value Pr > F
Model
2 24987949
12493974 6.06 0.0054
Error
36
74175757 2060438
Corrected Total
38 99163706
Root MSE
1435.42248 R-Square 0.2520
Dependent Mean 2023.56410 Adj
R-Sq 0.2104
Coeff Var
70.93536
Parameter Estimates
Parameter Standard
Variable DF
Estimate
Error t Value Pr > |t|
Intercept 1
2140.20468
434.48460 4.93
<.0001
income
1
0.21801
0.08032 2.71
0.0101
d2
1 -1006.62593
479.37174 -2.10
0.0428
d2=0: R&D = 2,140.205 + .218*income = 2,140.205 - 1,006.626*0 + .218*income
d2=1: R&D = 1,133.579 + .218*income = 2,140.205 - 1,006.626*1 + .218*income
2.5.2 LSDV 2 without the Intercept
LSDV 2 includes all dummy variables and suppresses the intercept. The Stata .regress command has the noconstant option to fit LSDV2. The coefficients of dummies are actual parameter estimates; thus, you do not need to compute intercepts of groups. This LSDV, however, reports wrong R2.
. regress rnd income d1 d2, noconstant
Source |
SS df
MS
Number of obs = 39
-------------+------------------------------
F( 3, 36) = 29.88
Model | 184685604 3
61561868.1 Prob
> F = 0.0000
Residual | 74175756.7 36
2060437.69
R-squared = 0.7135
-------------+------------------------------
Adj R-squared = 0.6896
Total | 258861361 39
6637470.79 Root
MSE = 1435.4
------------------------------------------------------------------------------
rnd | Coef. Std.
Err. t
P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | .2180066
.0803248 2.71 0.010
.0551004 .3809128
d1 | 2140.205 434.4846
4.93 0.000 1259.029
3021.38
d2 | 1133.579 344.0583
3.29 0.002 435.7962
1831.361
------------------------------------------------------------------------------
d1=1: R&D = 2,140.205 + .218*income
d2=1: R&D = 1,133.579 + .218*income
2.5.3 LSDV 3 with a Restriction
LSDV 3 includes the intercept and all dummies and then imposes a restriction on the model. The restriction is that the sum of all dummy parameters is zero. The Stata .constraint command defines a constraint, while the .cnsreg command fits a constrained OLS using the constraint()option. The number in the parenthesis indicates the constraint number defined in the .constraint command.
. constraint 1 d1 + d2 = 0
. cnsreg rnd income d1 d2, constraint(1)
Constrained
linear
regression
Number of obs = 39
F( 2, 36) = 6.06
Prob > F = 0.0054
Root MSE = 1435.4
(
1) d1 + d2 = 0
------------------------------------------------------------------------------
rnd | Coef. Std.
Err. t
P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
income | .2180066 .0803248
2.71 0.010 .0551004
.3809128
d1 | 503.313 239.6859
2.10 0.043 17.20749
989.4184
d2 | -503.313 239.6859
-2.10 0.043 -989.4184 -17.20749
_cons | 1636.892 310.0438
5.28 0.000 1008.094
2265.69
------------------------------------------------------------------------------
d1=1: R&D = 2,140.205 + .218*income = 1,637 + 503 *1 + (-503)*0 + .218*income
d2=1: R&D = 1,133.579 + .218*income = 1,637 + 503 *0 + (-503)*1 + .218*income
The intercept is the average of actual parameter estimates: 1,636 = (2,140+1,133)/2. In the SAS output below, the coefficient of RESTRICT is virtually zero and, in theory, should be zero.
PROC REG DATA=masil.rnd2002;
MODEL rnd = income d1 d2;
RESTRICT d1 + d2 = 0;
RUN;
The
REG Procedure
Model: MODEL1
Dependent Variable: rnd
NOTE: Restrictions have
been applied to parameter estimates.
Number of Observations Read
50
Number of Observations
Used
39
Number of Observations with Missing
Values 11
Analysis of Variance
Sum
of Mean
Source
DF
Squares
Square F Value Pr > F
Model
2 24987949
12493974 6.06 0.0054
Error 36
74175757 2060438
Corrected Total
38 99163706
Root MSE
1435.42248 R-Square 0.2520
Dependent Mean 2023.56410 Adj
R-Sq 0.2104
Coeff
Var 70.93536
Parameter Estimates
Parameter Standard
Variable DF
Estimate
Error t Value Pr > |t|
Intercept
1 1636.89172
310.04381 5.28
<.0001
income
1
0.21801
0.08032 2.71 0.0101
d1
1 503.31297
239.68587 2.10
0.0428
d2 1
-503.31297
239.68587 -2.10 0.0428
RESTRICT -1
1.81899E-12
0
. .
lang=EN-US
* Probability computed using beta distribution.
Table 4 compares how SAS, Stata, LIMDEP, and SPSS conducts LSDVs. SPSS is not able to fit the LSDV3. In LIMDEP, the b(2) of the Cls: indicates the parameter estimate of the second independent variable. In SPSS, pay attention to the /ORIGIN option for LSDV2.
Top
Table 4. Estimating Three LSDVs Using SAS, Stata, LIMDEP, and SPSS