7. Comparing the Proportions of Binary Variables
When a left-hand side variable to be compared is binary, the mean of the binary variable is the same as the proportion or the percentage of success. The data generation process (DGP) looks like the binomial distribution with two outcomes (i.e., 0 and 1) and an independent probability p of success.
When np >= 5 and n(1-p) >= 5 in general, the binomial distribution is approximated to the normal probability distribution. Therefore, we can compare proportions using the properties of these probability distributions. However, if a probability is extremely small or large, the z test becomes less accurate (Hildebrand et al. 2005: 332, 388).
7.1 Comparing a Proportion with a Hypothesized Proportion
Suppose n measurements of a binary variable y were taken from the binomial distribution with a probability of success p. If you want to compare the proportion with the hypothesized proportion, compute the z score for the standard normal distribution as follows (Hildebrand et al. 2005; Stata 2007; Bluman 2008).

Let us compare the proportions of y1 and y2 using SAS and Stata. Their proportions (means) are .6667 (=20/30) and .3333 (10/30), respectively.
. sum
-------------+--------------------------------------------------------
y1 | 30 .6666667 .4794633 0 1
y2 | 30 .3333333 .4794633 0 1
We are interested in whether the population proportion of y1 is .5. The test statistic z is 1.8257 = (2/3-.5) / sqrt(.5*.5/30) and the corresponding p-value is .0679. Therefore, the null hypothesis is rejected at the .05 level.
In Stata, use the .prtest command followed by a binary variable and the hypothesized proportion separated by an equal sign.
------------------------------------------------------------------------------
Variable | Mean Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
y1 | .6666667 .0860663 .4979798 .8353535
------------------------------------------------------------------------------
p = proportion(y1) z = 1.8257
Ho: p = 0.5
Ha: p < 0.5 Ha: p != 0.5 Ha: p > 0.5
Pr(Z < z) = 0.9661 Pr(|Z| > |z|) = 0.0679 Pr(Z > z) = 0.0339
The z score of 1.8257 above is not large sufficiently to reject the null hypothesis (p<.0679). The population proportion of y1 appears to be .5. The 95 percent confidence interval is .6667 ± 1.96 * sqrt((.6667*(1-.6667))/30).
If you have aggregated information only, try to use the .prtesti command, which is the immediate form of .prtest. This command is followed by the number of observations, the proportion of a sample, and hypothesized proportion.
(output is skipped)
In SAS, you need to take the point-and-click approach to compare proportions since SAS does not have a procedure to do this task in an easy manner. Click Solution--> Analyst--> Statistics--> Hypothesis Tests--> One-Sample Test for a Proportion. You are asked to choose the category of success or the level of interest (0 or 1).
Sample Statistics
y1 Frequency
------------------------
0 10
1 20
---------
Total 30
Hypothesis Test
Null Hypothesis: Proportion = 0.5
Alternative: Proportion ^= 0.5
y1 Proportion Z Statistic Pr > Z
---------------------------------------------------
1 0.6667 1.83 0.0679
In SAS and Stata, the test is based on the large-sample theory. If you have a small sample, you need to conduct the binomial probability test using the .bitest (or .bitesti) command in Stata (Stata 2007). The p-value .0987 below is slightly larger than .0679 above.
(output is skipped)
. bitesti 30 .3333 .5
------------------------------------------------------------
30 10 15 0.50000 0.33333
Pr(k >= 10) = 0.978613 (one-sided test)
Pr(k <= 10) = 0.049369 (one-sided test)
Pr(k <= 10 or k >= 20) = 0.098737 (two-sided test)
If you wish to compare two proportions, apply the following formula. The pooled (weighted) proportion is used under the null hypothesis of the equal proportion, p1=p2.
In Stata, the .prtest command enables you to use both types of data arrangement illustrated in Figure 3. If you have a data set arranged in the second type, list the two variables separated by an equal sign.
. prtest y1=y2
y2: Number of obs => 30
------------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y1 | .6666667 .0860663 .4979798 .8353535
y2 | .3333333 .0860663 .1646465 .5020202
-------------+----------------------------------------------------------------
diff | .3333333 .1217161 .0947741 .5718926
| under Ho: .1290994 2.58 0.010
------------------------------------------------------------------------------
diff = prop(y1) - prop(y2) z => 2.5820
Ho: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.9951 Pr(|Z| < |z|) = 0.0098> Pr(Z > z) = 0.0049>
The pooled proportion is .5 = (20+10)/(30+30). The z score is 2.5820 = (2/3-1/3) / sqrt(.5*.5*(1/30+1/30)) and its p-value is .0098. Accordingly, you may reject the null hypothesis of the equal proportion at the .05 level; two population proportions are different. The 95 percent confidence interval is .3333 ± 1.96 * sqrt(.6667*(1-.6667)/30 + .3333*(1-.3333)/30).
If the data set is arranged in the first type, run the following command.
. prtest y1, by(group)
(output is skipped)
Alternatively, you may use the following formula (Hildebrand et al. 2005: 386-388). Note that its denominator is used to construct the confidence interval of p1-p2.
This formula returns 2.7386 = (2/3-1/3) / sqrt(.6667*(1-.6667)/30 + .3333*(1-.3333)/30), which is slightly larger than 2.5820 above. We can reject the null hypothesis of equal proportion (p<.0062).
SAS produces the same output as the above. You need to select Solution--> Analyst--> Statistics--> Hypothesis Tests--> Two-Sample Test for a Proportion.
Sample Statistics
- Frequencies of -
Value v1 v2
----------------------------------------
0 10 20
1 20 10
Hypothesis Test
Null hypothesis: Proportion of v1 - Proportion of v2 = 0
Alternative: Proportion of v1 - Proportion of v2 ^= 0
- Proportions of -
Value v1 v2 Z Prob > Z
------------------------------------------------------------
1 0.6667 0.3333 2.58 0.0098
If you have aggregated information only, use the .prtesti command with the number observations and the proportion of success of two samples consecutively.
. prtesti 30 .6667 30 .3333
(output is skipped)
7.3 Comparing Means versus Comparing Proportions
Now, you may ask yourself: "What if I conduct the t-test to compare means of two binary variables?" or "What is the advantage of comparing proportions over comparing means (t-test)?" The simple answer is no big difference in case of a large sample size. Only difference between comparing means and proportions comes from the computation of denominators in the formula. The difference becomes smaller as the sample size increases. If N is sufficiently large, the t probability distribution and the binomial distribution are approximated to the normal distribution.
Let us perform the independent sample t-test on the same data and check the difference. The unpaired option indicates that the two samples are not paired but independent of each other.
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
y1 | 30 .6666667 .0875376 .4794633 .4876321 .8457012
y2 | 30 .3333333 .0875376 .4794633 .1542988 .5123679
---------+--------------------------------------------------------------------
combined | 60 .5 .0650945 .5042195 .3697463 .6302537
---------+--------------------------------------------------------------------
diff | .3333333 .1237969 .0855269 .5811397
------------------------------------------------------------------------------
diff = mean(y1) - mean(y2) t = 2.6926
Ho: diff = 0 degrees of freedom = 58
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9954 Pr(|T| > |t|) = 0.0093 Pr(T > t) = 0.0046
The t of 2.6926 is similar to the z score of 2.5820. Their p-values are respectively .0093 and .0098; the null hypothesis is rejected in both tests.
Table 6 suggests that the difference between comparing means (t-test) and proportions (z-test) become negligible as N becomes larger. The random variable a was drawn from RAND('BERNOULLI', .50) in SAS, which is the random number generator for the Bernoulli distribution with a probability of .50. Similarly, the variable b is generated from RAND('BERNOULLI', .55). Roughly speaking, the p-values of t and z become almost same as a sample size exceeds 30.
Up: Table of Contents
Next: Conclusion
Prev: Comparison Using the One-way ANOVA, GLM, and Regression



