Logistic Regression with SAS
LOGISTIC Procedure
Suppose the response variable Y is 0 or 1 binary (This is not a limitation. The values can be either numeric or character as long as they are dichotomous), and X1 and X2 are two regressors of interest. To fit a logistic regression, you can use:
SAS PROC LOGISTIC models the probability of Y=0 by default. In other words, SAS chooses the smaller value to estimate its probability. One way to change the default setting in order to model the probability of Y=1 in SAS is to specify the DESCENDING option on the PROC LOGISTIC statement. That is, use:
The following data are from Cox (Cox, D. R., 1970. The Analysis of Binary Data, London, Methuen, p. 86). At the specified time (T) of heating, a number of ingots are tested for some temperature settings and whether an ingot is ready or not (S) for rolling is recorded. S=0 means not ready and S=1 means ready. You want to know if the time of heating affects whether an ingot is ready or not for rolling.
T S
1 7 1
2 7 1
.
.
55 7 1
1 14 0
2 14 0
3 14 1
4 14 1
.
.
157 14 1
1 27 0
2 27 0
.
.
7 27 0
8 27 1
9 27 1
.
.
159 27 1
1 51 0
2 51 0
3 51 0
4 51 1
.
.
16 51 1
With this data set INGOT, you can use:
proc logistic data=ingot; model s=t; run;
As a result, you will have the following SAS output:
Sample Program: Logistic Regression
The LOGISTIC Procedure
Data Set: WORK.INGOT
Response Variable: S
Response Levels: 2
Number of Observations: 387
Link Function: Logit
Response Profile
Ordered
Value S Count
1 0 12
2 1 375
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 108.988 99.375 .
SC 112.947 107.291 .
-2 LOG L 106.988 95.375 11.614 with 1 DF (p=0.0007)
Score . . 15.100 with 1 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -5.4152 0.7275 55.4005 0.0001 . .
T 1 0.0807 0.0224 13.0290 0.0003 0.442056 1.084
Association of Predicted Probabilities and Observed Responses
Concordant = 59.2% Somers' D = 0.499
Discordant = 9.4% Gamma = 0.727
Tied = 31.4% Tau-a = 0.030
(4500 pairs) c = 0.749
The result shows that the estimated logit is

where p is the probability of having an ingot not ready for rolling. The slope coefficient 0.0807 represents the change in log odds for a one unit increase in T (time of heating). Its odds ratio 1.084 is the ratio of odds for a one unit change in T. The odds ratio can be computed by exponentiating the log odds, i.e., exp(log odds), which is exp(0.0807)=1.084 in this example.
If you had used the DESCENDING option:
proc logistic descending; model s=t; run;
it would have yielded the following estimated logit:

where p is the probability of having an ingot ready for rolling.
You may have the same data set arranged in the following frequency format:
T S F
7 1 55
14 0 2
14 1 155
27 0 7
27 1 152
51 0 3
51 1 13
In this case, to have the same output as above, you can use the syntax:
proc logistic; freq f; model s=t; run;
The LOGISTIC procedure also allows the input of binary response data that are grouped so that you can use:
proc logistic; model r/n=x1 x2; run;
where N represents the number of trials and R represents the number of events.
The data set described in the previous example can be arranged in a different way. At the specified time(T) of heating, the number of ingots (N) tested and the number (R) not ready for rolling can be recorded. Now you have:
T R N
7 0 55
14 2 157
27 7 159
51 3 16
With this data set INGOT2, you can use:
proc logistic data=ingot2; model r/n=t; run;
The SAS output will be:
Sample Program: Logistic Regression
The LOGISTIC Procedure
Data Set: WORK.INGOT2
Response Variable (Events): R
Response Variable (Trials): N
Number of Observations: 4
Link Function: Logit
Response Profile
Ordered Binary
Value Outcome Count
1 EVENT 12
2 NO EVENT 375
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 108.988 99.375 .
SC 112.947 107.291 .
-2 LOG L 106.988 95.375 11.614 with 1 DF (p=0.0007)
Score . . 15.100 with 1 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -5.4152 0.7275 55.4005 0.0001 . .
T 1 0.0807 0.0224 13.0290 0.0003 0.442056 1.084
Association of Predicted Probabilities and Observed Responses
Concordant = 59.2% Somers' D = 0.499
Discordant = 9.4% Gamma = 0.727
Tied = 31.4% Tau-a = 0.030
(4500 pairs) c = 0.749
Sometimes you may be interested in the change in log odds, and thus the corresponding change in odds ratio for some amount other than one unit change in the explanatory variable. In this case, you can customize your own odds calculation. You can use the UNITS option:
proc logistic; model y=x1 x2; units x1=list; run;
where list represents a list of units in change that are of interest for the variable X1. Each unit of change in a list has one of the following forms:
number SD or -SD number*SD
where number is any non-zero number and SD is the sample standard deviation of the corresponding independent variable X1.
Using the same data set in Example 2, if you use:
proc logistic data=ingot2; model r/n=t; units t=10 -10 sd 2*sd; run;
you will have the following result in addition to the output in Example 2:
Conditional Odds Ratio
Odds
Variable Unit Ratio
T 10.0000 2.241
T -10.0000 0.446
T 9.9361 2.230
T 19.8721 4.971
In this example, you calculated four different odd ratio, each corresponding to change in 10 unit increase, 10 unit decrease, 1 standard deviation increase, and 2 standard deviation increase in T, respectively.
From the SAS PROC LOGISTIC output, you can also obtain predicted probability values. Suppose you want to know the predicted probabilities of having an ingot not ready for rolling (Y=0) at each level of time of heating in the data set from Example 2. The predicted probability, p, can be computed from the formula:

Thus, for example, at T=7,

This computation can be easily obtained as a part of the SAS output by using the OUTPUT statement and PRINT procedure:
proc logistic; model r/n=x1 x2; output out=filename predicted=varname; run; proc print data=filename; run;
where filename is the output data set name and varname is the variable name for predicted probabilities. The SAS output will show all the predicted probabilities for all observation points.
However, if you need to know the predicted probabilities at some levels of explanatory variables other than levels the data set provides, you need to do something different. You need to create a new SAS data set with missing values for the response variable. Then you merge the new data with the original data and run the logistic regression using the merged data set. Because the new data set has missing values for the response variable, they do not affect the model fit. But the predicted probabilities will be also calculated for the new observations.
Using the data in Example 2, if you use:
proc logistic data=ingot2; model r/n=t; output out=prob predicted=phat; run; proc print data=prob; run;
you will have the following additional result to the output in Example 2:
Sample Program: Logistic Regression
OBS T R N PHAT
1 7 0 55 0.00777
2 14 2 157 0.01358
3 27 7 159 0.03782
4 51 3 16 0.21422
Now suppose you want to compute the predicted probabilities at T=10,20,30,40,50, and 60. You can use the following syntax:
data ingot2; input t r n; cards; 7 0 55 14 2 157 27 7 159 51 3 16 ; data new; input t @@; r=.; n=.; cards; 10 20 30 40 50 60 ; data merged; set ingot2 new; run; proc logistic data=merged; model r/n=t; output out=prob predicted=phat; run; proc print data=prob; run;
You will have the following additional output to show the predicted probability at each level of T of interest:
Sample Program: Logistic Regression
OBS T R N PHAT
1 7 0 55 0.00777
2 14 2 157 0.01358
3 27 7 159 0.03782
4 51 3 16 0.21422
5 10 . . 0.00987
6 20 . . 0.02185
7 30 . . 0.04768
8 40 . . 0.10089
9 50 . . 0.20095
10 60 . . 0.36045
PROBIT Procedure
You can even use the PROC PROBIT to fit a logistic regression by specifying LOGISTIC as the cumulative distribution type in the MODEL statement. To fit a logistic regression model, use:
proc probit; class y; model y=x1 x2 / d=logistic; run;
or
proc probit; model r/n=x1 x2 / d=logistic; run;
depending on your data set. If a single response variable is given in the MODEL statement, it must be listed in a CLASS statement. Unlike the PROC LOGISTIC, the PROC PROBIT is capable of dealing with categorical variables as regressors as shown in the following syntax:
proc probit; class x2; model r/n=x1 x2 / d=logistic; run;
where X2 is a categorical regressor.
Using the data in Example 2, you may use:
proc probit data=ingot2; model r/n=t / d=logistic; run;
The resulting SAS output will be:
Sample Program: Logistic Regression
Probit Procedure
Data Set =WORK.INGOT2
Dependent Variable=R
Dependent Variable=N
Number of Observations= 4
Number of Events = 12 Number of Trials = 387
Log Likelihood for LOGISTIC -47.68727905
Probit Procedure
Variable DF Estimate Std Err ChiSquare Pr>Chi Label/Value
INTERCPT 1 -5.4151721 0.727541 55.40004 0.0001 Intercept
T 1 0.08069587 0.022356 13.02885 0.0003
Probit Model in Terms of Tolerance Distribution
MU SIGMA
67.10594 12.39221
Estimated Covariance Matrix for Tolerance Parameters
MU SIGMA
MU 121.813302 35.655509
SIGMA 35.655509 11.786672
GENMOD Procedure
The GENMOD procedure fits generalized linear models (Nelder and Wedderburn, 1972, "Generalized Linear Models," Journal of the Royal Statistical Society A, 135, pp. 370-384). Logistic regression can be modeled as a class of generalized linear model where the response probability distribution function is binomial and the link function is logit. To use PROC GENMOD for a logistic regression, you can use:
proc genmod; model y=x1 x2 / dist=binomial link=logit; run;
or
proc genmod; model r/n=x1 x2 / dist=binomial link=logit; run;
Using the data in Example 2, you may use:
proc genmod data=ingot2; model r/n=t / dist=binomial link=logit; run;
You will have the following SAS output:
Sample Program: Logistic Regression
The GENMOD Procedure
Model Information
Description Value
Data Set WORK.INGOT2
Distribution BINOMIAL
Link Function LOGIT
Dependent Variable R
Dependent Variable N
Observations Used 4
Number Of Events 12
Number Of Trials 387
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 2 1.0962 0.5481
Scaled Deviance 2 1.0962 0.5481
Pearson Chi-Square 2 0.6749 0.3374
Scaled Pearson X2 2 0.6749 0.3374
Log Likelihood . -47.6873 .
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err ChiSquare Pr>Chi
INTERCEPT 1 -5.4152 0.7275 55.4000 0.0001
T 1 0.0807 0.0224 13.0289 0.0003
SCALE 0 1.0000 0.0000 . .
NOTE: The scale parameter was held fixed.
PROC GENMOD is especially convenient when you need to use categorical or class variables as regressors. In this case, you can use:
proc genmod; class x2; model y=x1 x2 / dist=binomial link=logit; run;
where X2 is a categorical regressor.
This example is excerpted from a SAS manual (SAS, 1996, SAS/STAT Software Changes and Enhancements through Release 6.11, pp. 279-284). In an experiment comparing the effects of five different drugs, each drug was tested on a number of different's ubjects. The outcome of each experiment was the presence or absence of a positive response in a subject. The following data represent the number of responses R in the N subjects for the five different drugs, labeled A through E. The response is measured for different levels of a continuous covariate X for each drug. The drug type and the covariate X are explanatory variables in this experiment. The number of response R is modeled as a binomial random variable for each combination of the explanatory variable values, with the binomial number of trials parameter equal to the number of subjects N and the binomial probability equal to the probability of a response. The following DATA step creates the data set DRUG:
data drug; input drug$ x r n; cards; A .1 1 10 A .23 2 12 A .67 1 9 B .2 3 13 B .3 4 15 B .45 5 16 B .78 5 13 C .04 0 10 C .15 0 11 C .56 1 12 C .7 2 12 D .34 5 10 D .6 5 9 D .7 8 10 E .2 12 20 E .34 15 20 E .56 13 15 E .8 17 20 ;
A logistic regression for these data is a generalized linear model with response equal to the binomial proportion R/N. PROC GENMOD can be used as follows:
proc genmod data=drug; class drug; model r/n=x drug / dist=binomial link=logit; run;
You will have the SAS output:
Sample Program: Logistic Regression
The GENMOD Procedure
Model Information
Description Value
Data Set WORK.DRUG
Distribution BINOMIAL
Link Function LOGIT
Dependent Variable R
Dependent Variable N
Observations Used 18
Number Of Events 99
Number Of Trials 237
Class Level Information
Class Levels Values
DRUG 5 A B C D E
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 12 5.2751 0.4396
Scaled Deviance 12 5.2751 0.4396
Pearson Chi-Square 12 4.5133 0.3761
Scaled Pearson X2 12 4.5133 0.3761
Log Likelihood . -114.7732 .
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err ChiSquare Pr>Chi
INTERCEPT 1 0.2792 0.4196 0.4430 0.5057
X 1 1.9794 0.7660 6.6770 0.0098
DRUG A 1 -2.8955 0.6092 22.5894 0.0001
DRUG B 1 -2.0162 0.4052 24.7628 0.0001
DRUG C 1 -3.7952 0.6655 32.5258 0.0001
DRUG D 1 -0.8548 0.4838 3.1218 0.0773
DRUG E 0 0.0000 0.0000 . .
SCALE 0 1.0000 0.0000 . .
NOTE: The scale parameter was held fixed.
In this example, PROC GENMOD automatically generates five dummy variables for each value of the class variable DRUG. Therefore, the same result could be obtained without using PROC GENMOD, but employing PROC LOGISTIC:
if drug='A' then drugdum1=1; else drugdum1=0; if drug='B' then drugdum2=1; else drugdum2=0; if drug='C' then drugdum3=1; else drugdum3=0; if drug='D' then drugdum4=1; else drugdum4=0; if drug='E' then drugdum5=1; else drugdum5=0; proc logistic data=drug2; model r/n=x drugdum1 drugdum2 drugdum3 drugdum4 drugdum5; run;
where the first five lines must be included in the DATA step to create a new data set DRUG2. Notice that one of the five dummy variables is redundant.
The resulting output will be:
Sample Program: Logistic Regression
The LOGISTIC Procedure
Data Set: WORK.DRUG2
Response Variable (Events): R
Response Variable (Trials): N
Number of Observations: 18
Link Function: Logit
Response Profile
Ordered Binary
Value Outcome Count
1 EVENT 99
2 NO EVENT 138
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 324.105 241.546 .
SC 327.573 262.355 .
-2 LOG L 322.105 229.546 92.558 with 5 DF (p=0.0001)
Score . . 82.029 with 5 DF (p=0.0001)
NOTE: The following parameters have been set to 0, since the variables are a
linear combination of other variables as shown.
DRUGDUM5 = 1 * INTERCPT - 1 * DRUGDUM1 - 1 * DRUGDUM2 - 1 * DRUGDUM3 - 1
* DRUGDUM4
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 0.2792 0.4196 0.4430 0.5057 . .
X 1 1.9794 0.7660 6.6772 0.0098 0.259740 7.238
DRUGDUM1 1 -2.8955 0.6092 22.5895 0.0001 -0.539417 0.055
DRUGDUM2 1 -2.0162 0.4052 24.7628 0.0001 -0.476082 0.133
DRUGDUM3 1 -3.7952 0.6654 32.5336 0.0001 -0.822382 0.022
DRUGDUM4 1 -0.8548 0.4838 3.1218 0.0773 -0.154773 0.425
DRUGDUM5 0 0 . . . . .
Association of Predicted Probabilities and Observed Responses
Concordant = 82.3% Somers' D = 0.686
Discordant = 13.7% Gamma = 0.714
Tied = 4.0% Tau-a = 0.335
(13662 pairs) c = 0.843
CATMOD Procedure
SAS CATMOD (CATegorical data MODeling) procedure fits linear models to functions of response frequencies and can be used for logistic regression. The basic syntax is:
proc catmod; direct x1; response logits; model y=x1 x2; run;
where X1 is a continuous quantitative variable and X2 is a categorical variable. You must specify your continuous regressors in the DIRECT statement. Because the CATMOD procedure is mainly designed for the analysis of categorical data, it is not recommended for use with a continuous regressor with a large number of unique values.
Using the data in Example 1, if you use:
proc catmod data=ingot; direct t; response logits; model s=t; run;
you will see the result:
Sample Program: Logistic Regression
CATMOD PROCEDURE
Response: S Response Levels (R)= 2
Weight Variable: None Populations (S)= 4
Data Set: INGOT Total Frequency (N)= 387
Frequency Missing: 0 Observations (Obs)= 387
POPULATION PROFILES
Sample
Sample T Size
1 7 55
2 14 157
3 27 159
4 51 16
RESPONSE PROFILES
Response S
1 0
2 1
MAXIMUM-LIKELIHOOD ANALYSIS
Sub -2 Log Convergence Parameter Estimates
Iteration Iteration Likelihood Criterion 1 2
0 0 536.49592 1.0000 0 0
1 0 152.59147 0.7156 -2.1503 0.0138
2 0 106.76794 0.3003 -3.5040 0.0361
3 0 96.711696 0.0942 -4.6746 0.0633
4 0 95.411914 0.0134 -5.2884 0.0779
5 0 95.374601 0.000391 -5.4109 0.0806
6 0 95.374558 4.5308E-7 -5.4152 0.0807
7 0 95.374558 6.605E-13 -5.4152 0.0807
MAXIMUM-LIKELIHOOD ANALYSIS-OF-VARIANCE TABLE
Source DF Chi-Square Prob
--------------------------------------------------
INTERCEPT 1 55.40 0.0000
T 1 13.03 0.0003
LIKELIHOOD RATIO 2 1.10 0.5781
ANALYSIS OF MAXIMUM-LIKELIHOOD ESTIMATES
Standard Chi-
Effect Parameter Estimate Error Square Prob
----------------------------------------------------------------
INTERCEPT 1 -5.4152 0.7275 55.40 0.0000
T 2 0.0807 0.0224 13.03 0.0003
Next: Logistic Regression with SPSS
Prev: Logistic Regression
Up: Logistic Regression



