Logistic Regression with SAS

LOGISTIC Procedure

Suppose the response variable Y is 0 or 1 binary (This is not a limitation. The values can be either numeric or character as long as they are dichotomous), and X1 and X2 are two regressors of interest. To fit a logistic regression, you can use:

proc logistic; model y=x1 x2; run;

SAS PROC LOGISTIC models the probability of Y=0 by default. In other words, SAS chooses the smaller value to estimate its probability. One way to change the default setting in order to model the probability of Y=1 in SAS is to specify the DESCENDING option on the PROC LOGISTIC statement. That is, use:

proc logistic descending;
Example 1: SAS Logistic Regression in PROC LOGISTIC (individual data)

The following data are from Cox (Cox, D. R., 1970. The Analysis of Binary Data, London, Methuen, p. 86). At the specified time (T) of heating, a number of ingots are tested for some temperature settings and whether an ingot is ready or not (S) for rolling is recorded. S=0 means not ready and S=1 means ready. You want to know if the time of heating affects whether an ingot is ready or not for rolling.

             T      S                          
      1      7      1
      2      7      1
             .
             . 
     55      7      1
      1     14      0
      2     14      0
      3     14      1
      4     14      1
             .
             .
    157     14      1
      1     27      0
      2     27      0
             .
             .
      7     27      0
      8     27      1
      9     27      1
             .
             .
    159     27      1
      1     51      0
      2     51      0
      3     51      0
      4     51      1
             .
             .
     16     51      1

With this data set INGOT, you can use:

  proc logistic data=ingot;
  model s=t;
  run;

As a result, you will have the following SAS output:

                      Sample Program: Logistic Regression                      

                             The LOGISTIC Procedure

     Data Set: WORK.INGOT
     Response Variable: S
     Response Levels: 2
     Number of Observations: 387
     Link Function: Logit

                                Response Profile

                           Ordered
                             Value       S     Count

                                 1       0        12
                                 2       1       375


      Model Fitting Information and Testing Global Null Hypothesis BETA=0

                               Intercept
                 Intercept        and
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC             108.988        99.375         .
   SC              112.947       107.291         .
   -2 LOG L        106.988        95.375       11.614 with 1 DF (p=0.0007)
   Score              .             .          15.100 with 1 DF (p=0.0001)


                    Analysis of Maximum Likelihood Estimates

               Parameter Standard    Wald       Pr >    Standardized     Odds
   Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

   INTERCPT 1    -5.4152   0.7275    55.4005     0.0001            .     .
   T        1     0.0807   0.0224    13.0290     0.0003     0.442056    1.084


         Association of Predicted Probabilities and Observed Responses

                   Concordant = 59.2%          Somers' D = 0.499
                   Discordant =  9.4%          Gamma     = 0.727
                   Tied       = 31.4%          Tau-a     = 0.030
                   (4500 pairs)                c         = 0.749

The result shows that the estimated logit is

logitcomp

where p is the probability of having an ingot not ready for rolling. The slope coefficient 0.0807 represents the change in log odds for a one unit increase in T (time of heating). Its odds ratio 1.084 is the ratio of odds for a one unit change in T. The odds ratio can be computed by exponentiating the log odds, i.e., exp(log odds), which is exp(0.0807)=1.084 in this example.

If you had used the DESCENDING option:

  proc logistic descending;
  model s=t;
  run;

it would have yielded the following estimated logit:

logitcomp2

where p is the probability of having an ingot ready for rolling.

You may have the same data set arranged in the following frequency format:

         T     S      F
         7     1     55
        14     0      2
        14     1    155
        27     0      7
        27     1    152
        51     0      3
        51     1     13

In this case, to have the same output as above, you can use the syntax:

  proc logistic;
  freq f;
  model s=t;
  run;

The LOGISTIC procedure also allows the input of binary response data that are grouped so that you can use:

  proc logistic;
  model r/n=x1 x2;
  run;

where N represents the number of trials and R represents the number of events.

Example 2: SAS Logistic Regression in PROC LOGISTIC (grouped data)

The data set described in the previous example can be arranged in a different way. At the specified time(T) of heating, the number of ingots (N) tested and the number (R) not ready for rolling can be recorded. Now you have:

         T     R      N                         
         7     0     55                         
        14     2    157                         
        27     7    159                         
        51     3     16

With this data set INGOT2, you can use:

  proc logistic data=ingot2; 
  model r/n=t;
  run;

The SAS output will be:

                      Sample Program: Logistic Regression 

                             The LOGISTIC Procedure

   Data Set: WORK.INGOT2
   Response Variable (Events): R
   Response Variable (Trials): N
   Number of Observations: 4
   Link Function: Logit
                                Response Profile

                          Ordered  Binary
                            Value  Outcome      Count

                                1  EVENT           12
                                2  NO EVENT       375

      Model Fitting Information and Testing Global Null Hypothesis BETA=0

                               Intercept
                 Intercept        and
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC             108.988        99.375         .
   SC              112.947       107.291         .
   -2 LOG L        106.988        95.375       11.614 with 1 DF (p=0.0007)
   Score              .             .          15.100 with 1 DF (p=0.0001)

                    Analysis of Maximum Likelihood Estimates

               Parameter Standard    Wald       Pr >    Standardized     Odds
   Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

   INTERCPT 1    -5.4152   0.7275    55.4005     0.0001            .     .
   T        1     0.0807   0.0224    13.0290     0.0003     0.442056    1.084

         Association of Predicted Probabilities and Observed Responses

                   Concordant = 59.2%          Somers' D = 0.499
                   Discordant =  9.4%          Gamma     = 0.727
                   Tied       = 31.4%          Tau-a     = 0.030
                   (4500 pairs)                c         = 0.749

Sometimes you may be interested in the change in log odds, and thus the corresponding change in odds ratio for some amount other than one unit change in the explanatory variable. In this case, you can customize your own odds calculation. You can use the UNITS option:

  proc logistic;
  model y=x1 x2;
  units x1=list;
  run;

where list represents a list of units in change that are of interest for the variable X1. Each unit of change in a list has one of the following forms:

  number
  SD or -SD
  number*SD

where number is any non-zero number and SD is the sample standard deviation of the corresponding independent variable X1.

Example 3: Customized Odds Computation

Using the same data set in Example 2, if you use:

  proc logistic data=ingot2;
  model r/n=t;
  units t=10 -10 sd 2*sd;
  run;

you will have the following result in addition to the output in Example 2:

                             Conditional Odds Ratio

                                                    Odds
                        Variable        Unit       Ratio

                        T            10.0000       2.241
                        T           -10.0000       0.446
                        T             9.9361       2.230
                        T            19.8721       4.971

In this example, you calculated four different odd ratio, each corresponding to change in 10 unit increase, 10 unit decrease, 1 standard deviation increase, and 2 standard deviation increase in T, respectively.

From the SAS PROC LOGISTIC output, you can also obtain predicted probability values. Suppose you want to know the predicted probabilities of having an ingot not ready for rolling (Y=0) at each level of time of heating in the data set from Example 2. The predicted probability, p, can be computed from the formula:

pcomp

Thus, for example, at T=7,

pcomp2

This computation can be easily obtained as a part of the SAS output by using the OUTPUT statement and PRINT procedure:

  proc logistic;
  model r/n=x1 x2;
  output out=filename predicted=varname;
  run;
  proc print data=filename;
  run;

where filename is the output data set name and varname is the variable name for predicted probabilities. The SAS output will show all the predicted probabilities for all observation points.

However, if you need to know the predicted probabilities at some levels of explanatory variables other than levels the data set provides, you need to do something different. You need to create a new SAS data set with missing values for the response variable. Then you merge the new data with the original data and run the logistic regression using the merged data set. Because the new data set has missing values for the response variable, they do not affect the model fit. But the predicted probabilities will be also calculated for the new observations.

Example 4: Predicted Probability Computation

Using the data in Example 2, if you use:

  proc logistic data=ingot2;
  model r/n=t;
  output out=prob predicted=phat;
  run;
  proc print data=prob;
  run;

you will have the following additional result to the output in Example 2:

                      Sample Program: Logistic Regression                      

                        OBS     T    R     N       PHAT

                         1      7    0     55    0.00777
                         2     14    2    157    0.01358
                         3     27    7    159    0.03782
                         4     51    3     16    0.21422

Now suppose you want to compute the predicted probabilities at T=10,20,30,40,50, and 60. You can use the following syntax:

  data ingot2;
  input t r n;
  cards;
   7 0  55
  14 2 157
  27 7 159
  51 3  16
  ;
  data new;
  input t @@;
  r=.;
  n=.;
  cards;
  10 20 30 40 50 60
  ;
  data merged;
  set ingot2 new;
  run;
  proc logistic data=merged;
  model r/n=t;
  output out=prob predicted=phat;
  run;
  proc print data=prob;
  run;

You will have the following additional output to show the predicted probability at each level of T of interest:

                      Sample Program: Logistic Regression                      

                        OBS     T    R     N       PHAT

                          1     7    0     55    0.00777
                          2    14    2    157    0.01358
                          3    27    7    159    0.03782
                          4    51    3     16    0.21422
                          5    10    .      .    0.00987
                          6    20    .      .    0.02185
                          7    30    .      .    0.04768
                          8    40    .      .    0.10089
                          9    50    .      .    0.20095
                         10    60    .      .    0.36045

PROBIT Procedure

You can even use the PROC PROBIT to fit a logistic regression by specifying LOGISTIC as the cumulative distribution type in the MODEL statement. To fit a logistic regression model, use:

  proc probit;
  class y;
  model y=x1 x2 / d=logistic;
  run;

or

  proc probit;
  model r/n=x1 x2 / d=logistic;
  run;

depending on your data set. If a single response variable is given in the MODEL statement, it must be listed in a CLASS statement. Unlike the PROC LOGISTIC, the PROC PROBIT is capable of dealing with categorical variables as regressors as shown in the following syntax:

  proc probit;
  class x2;
  model r/n=x1 x2 / d=logistic;
  run;

where X2 is a categorical regressor.

Example 5: SAS Logistic Regression in PROC PROBIT

Using the data in Example 2, you may use:

  proc probit data=ingot2;
  model r/n=t / d=logistic;
  run;

The resulting SAS output will be:

                     Sample Program: Logistic Regression                     

                                Probit Procedure

   Data Set          =WORK.INGOT2
   Dependent Variable=R
   Dependent Variable=N
   Number of Observations=   4
   Number of Events      =      12    Number of Trials =      387


   Log Likelihood for LOGISTIC -47.68727905


                                Probit Procedure

          Variable  DF   Estimate  Std Err ChiSquare  Pr>Chi Label/Value

          INTERCPT   1 -5.4151721 0.727541  55.40004  0.0001 Intercept
          T          1 0.08069587 0.022356  13.02885  0.0003

                Probit Model in Terms of Tolerance Distribution

                                     MU         SIGMA
                               67.10594      12.39221


              Estimated Covariance Matrix for Tolerance Parameters

                                            MU             SIGMA

                          MU        121.813302         35.655509
                       SIGMA         35.655509         11.786672

GENMOD Procedure

The GENMOD procedure fits generalized linear models (Nelder and Wedderburn, 1972, "Generalized Linear Models," Journal of the Royal Statistical Society A, 135, pp. 370-384). Logistic regression can be modeled as a class of generalized linear model where the response probability distribution function is binomial and the link function is logit. To use PROC GENMOD for a logistic regression, you can use:

  proc genmod;
  model y=x1 x2 / dist=binomial link=logit;
  run;

or

  proc genmod;
  model r/n=x1 x2 / dist=binomial link=logit;
  run;
Example 6: SAS Logistic Regression in PROC GENMOD

Using the data in Example 2, you may use:

  proc genmod data=ingot2;
  model r/n=t / dist=binomial link=logit;
  run;

You will have the following SAS output:

                       Sample Program: Logistic Regression

                              The GENMOD Procedure

                               Model Information

                   Description                     Value

                   Data Set                        WORK.INGOT2
                   Distribution                    BINOMIAL
                   Link Function                   LOGIT
                   Dependent Variable              R
                   Dependent Variable              N
                   Observations Used               4
                   Number Of Events                12
                   Number Of Trials                387


                     Criteria For Assessing Goodness Of Fit

              Criterion             DF         Value      Value/DF

              Deviance               2        1.0962        0.5481
              Scaled Deviance        2        1.0962        0.5481
              Pearson Chi-Square     2        0.6749        0.3374
              Scaled Pearson X2      2        0.6749        0.3374
              Log Likelihood         .      -47.6873             .


                        Analysis Of Parameter Estimates

          Parameter    DF    Estimate     Std Err   ChiSquare  Pr>Chi

          INTERCEPT     1     -5.4152      0.7275     55.4000  0.0001
          T             1      0.0807      0.0224     13.0289  0.0003
          SCALE         0      1.0000      0.0000           .       .

NOTE:  The scale parameter was held fixed.

PROC GENMOD is especially convenient when you need to use categorical or class variables as regressors. In this case, you can use:

  proc genmod;
  class x2;
  model y=x1 x2 / dist=binomial link=logit;
  run;

where X2 is a categorical regressor.

Example 7: SAS Logistic Regression in PROC GENMOD (categorical regressors)

This example is excerpted from a SAS manual (SAS, 1996, SAS/STAT Software Changes and Enhancements through Release 6.11, pp. 279-284). In an experiment comparing the effects of five different drugs, each drug was tested on a number of different's ubjects. The outcome of each experiment was the presence or absence of a positive response in a subject. The following data represent the number of responses R in the N subjects for the five different drugs, labeled A through E. The response is measured for different levels of a continuous covariate X for each drug. The drug type and the covariate X are explanatory variables in this experiment. The number of response R is modeled as a binomial random variable for each combination of the explanatory variable values, with the binomial number of trials parameter equal to the number of subjects N and the binomial probability equal to the probability of a response. The following DATA step creates the data set DRUG:

  data drug;
  input drug$ x r n;
  cards;
  A  .1   1  10
  A  .23  2  12
  A  .67  1   9
  B  .2   3  13
  B  .3   4  15
  B  .45  5  16
  B  .78  5  13
  C  .04  0  10
  C  .15  0  11
  C  .56  1  12
  C  .7   2  12
  D  .34  5  10
  D  .6   5   9
  D  .7   8  10
  E  .2  12  20
  E  .34 15  20
  E  .56 13  15
  E  .8  17  20
  ;

A logistic regression for these data is a generalized linear model with response equal to the binomial proportion R/N. PROC GENMOD can be used as follows:

  proc genmod data=drug;
  class drug;
  model r/n=x drug / dist=binomial link=logit;
  run;

You will have the SAS output:

                 Sample Program: Logistic Regression

                        The GENMOD Procedure

                          Model Information

              Description                     Value

              Data Set                        WORK.DRUG
              Distribution                    BINOMIAL
              Link Function                   LOGIT
              Dependent Variable              R
              Dependent Variable              N
              Observations Used               18
              Number Of Events                99
              Number Of Trials                237


                       Class Level Information

                     Class     Levels  Values

                     DRUG           5  A B C D E


                Criteria For Assessing Goodness Of Fit

         Criterion             DF         Value      Value/DF

         Deviance              12        5.2751        0.4396
         Scaled Deviance       12        5.2751        0.4396
         Pearson Chi-Square    12        4.5133        0.3761
         Scaled Pearson X2     12        4.5133        0.3761
         Log Likelihood         .     -114.7732             .


                   Analysis Of Parameter Estimates

    Parameter       DF    Estimate     Std Err   ChiSquare  Pr>Chi

    INTERCEPT        1      0.2792      0.4196      0.4430  0.5057
    X                1      1.9794      0.7660      6.6770  0.0098
    DRUG       A     1     -2.8955      0.6092     22.5894  0.0001
    DRUG       B     1     -2.0162      0.4052     24.7628  0.0001
    DRUG       C     1     -3.7952      0.6655     32.5258  0.0001
    DRUG       D     1     -0.8548      0.4838      3.1218  0.0773
    DRUG       E     0      0.0000      0.0000           .       .
    SCALE            0      1.0000      0.0000           .       .

NOTE:  The scale parameter was held fixed.

In this example, PROC GENMOD automatically generates five dummy variables for each value of the class variable DRUG. Therefore, the same result could be obtained without using PROC GENMOD, but employing PROC LOGISTIC:

  if drug='A' then drugdum1=1; else drugdum1=0;
  if drug='B' then drugdum2=1; else drugdum2=0;
  if drug='C' then drugdum3=1; else drugdum3=0;
  if drug='D' then drugdum4=1; else drugdum4=0;
  if drug='E' then drugdum5=1; else drugdum5=0;
  proc logistic data=drug2;
  model r/n=x drugdum1 drugdum2 drugdum3 drugdum4 drugdum5;
  run;

where the first five lines must be included in the DATA step to create a new data set DRUG2. Notice that one of the five dummy variables is redundant.

The resulting output will be:

                      Sample Program: Logistic Regression                      

                             The LOGISTIC Procedure

   Data Set: WORK.DRUG2
   Response Variable (Events): R
   Response Variable (Trials): N
   Number of Observations: 18
   Link Function: Logit

                                Response Profile

                          Ordered  Binary
                            Value  Outcome      Count

                                1  EVENT           99
                                2  NO EVENT       138


      Model Fitting Information and Testing Global Null Hypothesis BETA=0

                               Intercept
                 Intercept        and
   Criterion       Only       Covariates    Chi-Square for Covariates

   AIC             324.105       241.546         .
   SC              327.573       262.355         .
   -2 LOG L        322.105       229.546       92.558 with 5 DF (p=0.0001)
   Score              .             .          82.029 with 5 DF (p=0.0001)


NOTE: The following parameters have been set to 0, since the variables are a
      linear combination of other variables as shown.

      DRUGDUM5 = 1 * INTERCPT - 1 * DRUGDUM1 - 1 * DRUGDUM2 - 1 * DRUGDUM3 - 1
                 * DRUGDUM4


                    Analysis of Maximum Likelihood Estimates

               Parameter Standard    Wald       Pr >    Standardized     Odds
   Variable DF  Estimate   Error  Chi-Square Chi-Square   Estimate      Ratio

   INTERCPT 1     0.2792   0.4196     0.4430     0.5057            .     .
   X        1     1.9794   0.7660     6.6772     0.0098     0.259740    7.238
   DRUGDUM1 1    -2.8955   0.6092    22.5895     0.0001    -0.539417    0.055
   DRUGDUM2 1    -2.0162   0.4052    24.7628     0.0001    -0.476082    0.133
   DRUGDUM3 1    -3.7952   0.6654    32.5336     0.0001    -0.822382    0.022
   DRUGDUM4 1    -0.8548   0.4838     3.1218     0.0773    -0.154773    0.425
   DRUGDUM5 0          0        .      .          .                .     .


         Association of Predicted Probabilities and Observed Responses

                   Concordant = 82.3%          Somers' D = 0.686
                   Discordant = 13.7%          Gamma     = 0.714
                   Tied       =  4.0%          Tau-a     = 0.335
                   (13662 pairs)               c         = 0.843

CATMOD Procedure

SAS CATMOD (CATegorical data MODeling) procedure fits linear models to functions of response frequencies and can be used for logistic regression. The basic syntax is:

  proc catmod;
  direct x1;
  response logits;
  model y=x1 x2;
  run;

where X1 is a continuous quantitative variable and X2 is a categorical variable. You must specify your continuous regressors in the DIRECT statement. Because the CATMOD procedure is mainly designed for the analysis of categorical data, it is not recommended for use with a continuous regressor with a large number of unique values.

Example 8: SAS Logistic Regression in PROC CATMOD

Using the data in Example 1, if you use:

  proc catmod data=ingot;
  direct t;
  response logits;
  model s=t;
  run;

you will see the result:

                      Sample Program: Logistic Regression                     

                                CATMOD PROCEDURE

        Response: S                           Response Levels (R)=     2
        Weight Variable: None                 Populations     (S)=     4
        Data Set: INGOT                       Total Frequency (N)=   387
        Frequency Missing: 0                  Observations  (Obs)=   387


                               POPULATION PROFILES
                                            Sample
                              Sample  T      Size 
                                  1    7        55
                                  2   14       157
                                  3   27       159
                                  4   51        16


                               RESPONSE PROFILES

                                  Response  S
                                       1    0
                                       2    1


                           MAXIMUM-LIKELIHOOD ANALYSIS

                   Sub        -2 Log     Convergence    Parameter Estimates
    Iteration   Iteration   Likelihood    Criterion         1           2   
         0           0       536.49592       1.0000            0           0
         1           0       152.59147       0.7156      -2.1503      0.0138
         2           0       106.76794       0.3003      -3.5040      0.0361
         3           0       96.711696       0.0942      -4.6746      0.0633
         4           0       95.411914       0.0134      -5.2884      0.0779
         5           0       95.374601     0.000391      -5.4109      0.0806
         6           0       95.374558    4.5308E-7      -5.4152      0.0807
         7           0       95.374558    6.605E-13      -5.4152      0.0807


                 MAXIMUM-LIKELIHOOD ANALYSIS-OF-VARIANCE TABLE

               Source                   DF   Chi-Square      Prob
               --------------------------------------------------
               INTERCEPT                 1        55.40    0.0000
               T                         1        13.03    0.0003

               LIKELIHOOD RATIO          2         1.10    0.5781



                    ANALYSIS OF MAXIMUM-LIKELIHOOD ESTIMATES

                                               Standard    Chi-
        Effect            Parameter  Estimate    Error    Square   Prob
        ----------------------------------------------------------------
        INTERCEPT                 1   -5.4152    0.7275    55.40  0.0000
        T                         2    0.0807    0.0224    13.03  0.0003

Next: Logistic Regression with SPSS
Prev: Logistic Regression
Up: Logistic Regression