7. Comparing the Proportions of Binary Variables

When a left-hand side variable to be compared is binary, the mean of the binary variable is the same as the proportion or the percentage of success. The data generation process (DGP) looks like the binomial distribution with two outcomes (i.e., 0 and 1) and an independent probability p of success.

When np >= 5 and n(1-p) >= 5 in general, the binomial distribution is approximated to the normal probability distribution. Therefore, we can compare proportions using the properties of these probability distributions. However, if a probability is extremely small or large, the z test becomes less accurate (Hildebrand et al. 2005: 332, 388).

7.1 Comparing a Proportion with a Hypothesized Proportion

Suppose n measurements of a binary variable y were taken from the binomial distribution with a probability of success p. If you want to compare the proportion with the hypothesized proportion, compute the z score for the standard normal distribution as follows (Hildebrand et al. 2005; Stata 2007; Bluman 2008).

Proporiton Formula

Let us compare the proportions of y1 and y2 using SAS and Stata. Their proportions (means) are .6667 (=20/30) and .3333 (10/30), respectively.

. sum

    Variable |       Obs        Mean   Std. Dev.      Min        Max
-------------+--------------------------------------------------------
          y1 |        30    .6666667   .4794633         0          1
          y2 |        30    .3333333   .4794633         0          1

We are interested in whether the population proportion of y1 is .5. The test statistic z is 1.8257 = (2/3-.5) / sqrt(.5*.5/30) and the corresponding p-value is .0679. Therefore, the null hypothesis is rejected at the .05 level.

In Stata, use the .prtest command followed by a binary variable and the hypothesized proportion separated by an equal sign.

. prtest y1=.5
One-sample test of proportion                     y1: Number of obs =       30
------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.                     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          y1 |   .6666667   .0860663                      .4979798    .8353535
------------------------------------------------------------------------------
    p = proportion(y1)                                            z =   1.8257
Ho: p = 0.5
 
     Ha: p < 0.5                 Ha: p != 0.5                   Ha: p > 0.5
 Pr(Z < z) = 0.9661         Pr(|Z| > |z|) = 0.0679          Pr(Z > z) = 0.0339

The z score of 1.8257 above is not large sufficiently to reject the null hypothesis (p<.0679). The population proportion of y1 appears to be .5. The 95 percent confidence interval is .6667 ± 1.96 * sqrt((.6667*(1-.6667))/30).

If you have aggregated information only, try to use the .prtesti command, which is the immediate form of .prtest. This command is followed by the number of observations, the proportion of a sample, and hypothesized proportion.

. prtesti 30 .3333 .5
(output is skipped)

In SAS, you need to take the point-and-click approach to compare proportions since SAS does not have a procedure to do this task in an easy manner. Click Solution--> Analyst--> Statistics--> Hypothesis Tests--> One-Sample Test for a Proportion. You are asked to choose the category of success or the level of interest (0 or 1).

                               One Sample Test of a Proportion                                

   Sample Statistics

        y1             Frequency
        ------------------------
        0                    10
        1                    20
                       ---------
        Total                30

   Hypothesis Test

        Null Hypothesis:    Proportion =  0.5
        Alternative:        Proportion ^= 0.5

        y1             Proportion     Z Statistic    Pr > Z
        ---------------------------------------------------
        1                0.6667          1.83        0.0679

In SAS and Stata, the test is based on the large-sample theory. If you have a small sample, you need to conduct the binomial probability test using the .bitest (or .bitesti) command in Stata (Stata 2007). The p-value .0987 below is slightly larger than .0679 above.

. bitest y1=.5
(output is skipped)

. bitesti 30 .3333 .5
        N   Observed k   Expected k   Assumed p   Observed p
------------------------------------------------------------
       30         10           15       0.50000      0.33333
 
  Pr(k >= 10)            = 0.978613  (one-sided test)
  Pr(k <= 10)            = 0.049369  (one-sided test)
  Pr(k <= 10 or k >= 20) = 0.098737  (two-sided test)

Top

7.2 Comparing Two Proportions

If you wish to compare two proportions, apply the following formula. The pooled (weighted) proportion is used under the null hypothesis of the equal proportion, p1=p2.

Formula 1

In Stata, the .prtest command enables you to use both types of data arrangement illustrated in Figure 3. If you have a data set arranged in the second type, list the two variables separated by an equal sign.

. prtest y1=y2

Two-sample test of proportion                     y1: Number of obs =>       30
                                                  y2: Number of obs =>       30
------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          y1 |   .6666667   .0860663                      .4979798    .8353535
          y2 |   .3333333   .0860663                      .1646465    .5020202
-------------+----------------------------------------------------------------
        diff |   .3333333   .1217161                      .0947741    .5718926
             |  under Ho:   .1290994     2.58   0.010
------------------------------------------------------------------------------
        diff = prop(y1) - prop(y2)                                z =>   2.5820
    Ho: diff = 0
 
    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.9951         Pr(|Z| < |z|) = 0.0098>          Pr(Z > z) = 0.0049>

The pooled proportion is .5 = (20+10)/(30+30). The z score is 2.5820 = (2/3-1/3) / sqrt(.5*.5*(1/30+1/30)) and its p-value is .0098. Accordingly, you may reject the null hypothesis of the equal proportion at the .05 level; two population proportions are different. The 95 percent confidence interval is .3333 ± 1.96 * sqrt(.6667*(1-.6667)/30 + .3333*(1-.3333)/30).

If the data set is arranged in the first type, run the following command.

. prtest y1, by(group)
(output is skipped)

Alternatively, you may use the following formula (Hildebrand et al. 2005: 386-388). Note that its denominator is used to construct the confidence interval of p1-p2.

Formula 2

This formula returns 2.7386 = (2/3-1/3) / sqrt(.6667*(1-.6667)/30 + .3333*(1-.3333)/30), which is slightly larger than 2.5820 above. We can reject the null hypothesis of equal proportion (p<.0062).

SAS produces the same output as the above. You need to select Solution--> Analyst--> Statistics--> Hypothesis Tests--> Two-Sample Test for a Proportion.

                          Two Sample Test of Equality of Proportions                                                                 
   Sample Statistics                                                                                                                   
                                - Frequencies of -                                                                                     
      Value                   v1            v2
      ----------------------------------------
      0                       10            20
      1                       20            10
                                                                                                                                                                                                                                   
   Hypothesis Test

      Null hypothesis:  Proportion of v1 - Proportion of v2 =  0
      Alternative:      Proportion of v1 - Proportion of v2 ^= 0

                                - Proportions of -
      Value                   v1            v2      Z     Prob > Z
      ------------------------------------------------------------
      1                   0.6667        0.3333     2.58    0.0098       

If you have aggregated information only, use the .prtesti command with the number observations and the proportion of success of two samples consecutively.

. prtesti 30 .6667 30 .3333
(output is skipped)

Top

7.3 Comparing Means versus Comparing Proportions

Now, you may ask yourself: "What if I conduct the t-test to compare means of two binary variables?" or "What is the advantage of comparing proportions over comparing means (t-test)?" The simple answer is no big difference in case of a large sample size. Only difference between comparing means and proportions comes from the computation of denominators in the formula. The difference becomes smaller as the sample size increases. If N is sufficiently large, the t probability distribution and the binomial distribution are approximated to the normal distribution.

Let us perform the independent sample t-test on the same data and check the difference. The unpaired option indicates that the two samples are not paired but independent of each other.

. ttest y1=y2, unpaired
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
      y1 |      30    .6666667    .0875376    .4794633    .4876321    .8457012
      y2 |      30    .3333333    .0875376    .4794633    .1542988    .5123679
---------+--------------------------------------------------------------------
combined |      60          .5    .0650945    .5042195    .3697463    .6302537
---------+--------------------------------------------------------------------
    diff |            .3333333    .1237969                .0855269    .5811397
------------------------------------------------------------------------------
    diff = mean(y1) - mean(y2)                                    t =   2.6926
Ho: diff = 0                                     degrees of freedom =       58
 
    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9954         Pr(|T| > |t|) = 0.0093          Pr(T > t) = 0.0046

The t of 2.6926 is similar to the z score of 2.5820. Their p-values are respectively .0093 and .0098; the null hypothesis is rejected in both tests.

Table 6 suggests that the difference between comparing means (t-test) and proportions (z-test) become negligible as N becomes larger. The random variable a was drawn from RAND('BERNOULLI', .50) in SAS, which is the random number generator for the Bernoulli distribution with a probability of .50. Similarly, the variable b is generated from RAND('BERNOULLI', .55). Roughly speaking, the p-values of t and z become almost same as a sample size exceeds 30.

Table 6



Up: Table of Contents
Next: Conclusion
Prev: Comparison Using the One-way ANOVA, GLM, and Regression