Education | Seminar in Logistic Regression
Y750 | 5508 | Dr. Joanne Peng


Course Comments

1.  The objectives of the course are (1) to gain an understanding of
some basic features of              logistic regression modeling, and
(2) to develop a coherent approach for evaluating the
applications of logistic regression modeling in your field of study.

2.  Instructions will consist of  lectures, discussions, computer
exercises, and student
presentations.

3.  All students will be expected to complete three projects.  The
projects are assigned by the instructor with an aim to facilitate
student's understanding of various issues surrounding logistic
regression.  The third and final project will be an in-depth
investigation of one or more issues about logistic regression which
the student finds intriguing.  For this third project, each student
will be given an opportunity to orally present his/her design and
findings in class.

4 .The primary prerequisite to this course is an equivalent of a
second course in applied statistics which covered ordinary least
squares regression models.  An aptitude for mathematical analysis and
SAS programming is beneficial.  Deficiencies in particular areas will
be remedied in tutorial sessions.

5.Required texts for the course are (a) SAS Institute's Logistic
Regression: Examples using the SAS system (SAS Institute, 1995), and
(b) Hosmer and Lemeshow's Applied Logistic Regression (John Wiley and
Sons, 2001).  Suggested references are (i) Kleinbaum's Logistic
Regression (Springer, 1994), and (ii) Menard' Applied Logistic
Regression Analysis (Sage Publication #106, 1992).  Additional
readings are found in a course packet (required) or will be
distributed in class.

Schedule

Lecture/Topic/from text (b)

1  Introduction to logistic modeling/section 1.1

2  Fitting the simplest logistic regression model/Ch. 1

3  Fitting multiple logistic regression model/Ch. 2

4, 5, 6, 7  Interpreting coefficients of logistic regression/Ch. 3
models

8 & 9   Model building strategies and methods/Ch. 4

10 & 11 Assessing the fit of the model/Ch. 5

12, 13, 14  Logistic regression for matched case-control studies/Ch. 7

Advanced topic: Polytomous logistic regression/Ch. 8.1
or
Choosing between logistic regression and discriminant analysis

15   Student presentations (12/10, 8:30 to 12:15 pm)/The 3rd (final)
paper is due in class.

Note: In order for you to devote sufficient time to the final project,
it is best that the formal lectures/classes end on November 19th (a
week before the Thanksgiving recess).  Therefore, I'd like to suggest
that we meet twice for the next two weeks, say on Monday or Tuesday in
the evenings.  This arrangement will ensure that we cover essential
materials in time for you to carry out a meaningful final project.

Grading System

The final course grade will be a composite of grades assigned to three
papers which students are expected to complete throughout the
semester.   Equal weights will be applied to the three paper grades in
determining the composite.  Instructions on writing each paper as well
as criteria in determining paper grades will be announced in class.

Incompletes will be given only for a legitimate reason as outlined in
the university's Academic Guide, and only after a conference between
the instructor and the student.  Throughout the course of this
section, you may contest every grade awarded to your papers or the
overall course performance within 48 hours of receiving such a grade.
Once this "statute of limitation" has passed, it is assumed that you
willingly accept the grade(s) assigned without further dispute.

Labor Sharing System

Each participant is encouraged to contribute to the overall learning
of the class by taking up responsibilities for 3-4 tasks throughout
the semester.  These tasks may include (1) solving an assigned problem
from the text, (2) performing computer analysis of data, (3) leading
discussions of assigned articles, (4) surfing the internet for
information or resources on logistic regression, etc.  The sharing of
these responsibilities is on a voluntary basis.

Academic Honesty and Intellectual Integrity

According to P.72 of the Academic Handbook (June 1992 edition), each
faculty member has "a responsibility to foster the intellectual
honesty as well as the intellectual development of his/her students."
In order to achieve these goals, each student enrolled in this course
is prohibited from engaging in any form of "cheating" or "plagiarism."
Cheating is defined as and "dishonesty of any kind with respect to
examination, course assignments, alteration of records, or illegal
possession of examinations"  (p. 72 of the Academic Handbook).  "It is
the responsibility of the student not only to abstain from cheating
but, in addition, to avoid the appearance of cheating and to guard
against making it possible for others to cheat.  Any student who helps
another student to cheat is as guilty of cheating as the student he or
she assists.  The student also should do everything possible to induce
respect for the examining process and for honesty in the performance
of assigned tasks in or out of class." (p. 72 of Academic Handbook).

Plagiarism is defined as "offering the work of someone else as one's
own" (p. 72 of Academic Handbook).  "The language or ideas thus taken
from another may range from isolated formulas, sentences, or
paragraphs to entire articles copied from books, periodicals,
speeches, or the writings of other students.  The offering of
materials assembled or collected by others in the form of projects or
collections without acknowledgment also is considered plagiarism.  Any
student who fails to give credit for ideas or materials taken from
another source is guilty of plagiarism." (p.72 of Academic Handbook).

Evidence of student academic misconduct will result in (a) a lowered
course grade, (b) transfer out of this course, (c) dismissal from
student's academic unit, or (d) other disciplinary actions in
accordance with the guidelines outlined on p.73 of Academic Handbook.

Example #1 of First Take-Home Project in Y750
for XXXXX (SS#)
Deadline: XXX, 2001

PROJECT EXERCISES:

1.   Given the data below, fit a logistic regression model so that a
county's ability to compete for at least one construction grant is
predicted from its share of the state population, population density,
population change (in percents), median household income, its share of
state minorities, and percent of residents living in poverty.

2.   Discuss the overall significance of the model, significance of
each predictor (or covariate), and the interpretation of any design
variable contained in the model.

3.   Can you improve upon this full regression model by removing any
predictor that is not significantly related to a county's record of
receiving at least one construction grant?  If so, what is the new,
more efficient regression model?

4.   Compare the two models obtained from (1) and (3) above in terms
of the overall fit, the significance of predictors, the
interpretability of the model, the predictive power of each model,
etc.

DATA DESCRIPTION:

The data set is composed of two data sources: one is the county-wide
demographic data and the other is the Consolidated Federal Funds
Report (Bureau of the Census, 1993).  Populations under study included
all counties in the 48 contiguous states during the period 1983-1992
(N=3110).  The objective of the data collected during this period was
to determine whether disadvantaged community has equal access to
environmental infrastructure as other communities do.  Disadvantaged
communities include small, rural, low income, and minority counties,
though rural counties will not be distinguished from urban counties
for the purpose of this first take-home project.

Environmental infrastructure under examination is the Construction
Grants Program administered pursuant to the Clean Water Act.  The
environmental justice literature suggests that the county share of
state population, percent population change, population density,
county share of state minorities, median household income are
important explanatory variables.  These will be used to account for
and help identify those counties which received at least one grant
during the period of 1983-1992 (a total of 9,854 grants awarded.)

CODING SHEET FOR VARIABLES:

Dependent variable:   whether a county received at least one
construction grant during the period mentioned above.
Independent variables:   county share of state population, percent
population change, population density, county share of state
minorities, and median household income.

Variable names              Definitions and codings
GAWARD                     Grant received (0= zero grant received,
1=at least one grant received.)

AMTTOT                       Total amount of grants awarded (in
dollars)

POP_RAT                      County share of state population

DEN92                           Population density (population/square
miles)

PC_CHAN                     Population change

INCOME                       Median household income (in dollars)

MIN_RAPC_POPPV                    Percent of population in poverty (%)

REFERENCE:

Bureau of the Census, U.S. Department of Commerce (1993).
Consolidated Federal Funds Report (CFFR) on ROM, Fiscal Year
1983-1992.  Washington, DC: Bureau of Census, U. S. Department of
Commerce.

Example #2 of First Take-Home Project in Y750
for XXXXX (SS#)
Deadline: XXX, 2001

PROJECT EXERCISES:

1.   Given the data below, fit a logistic regression model so that a
company's overall image is predicted from its promotion of current
products, promotion of medications under development, quality of sales
representatives, sponsorship of scientific/clinical symposia and
educational programs, quality of corporate hospitality and
entertainment, quality of its exhibition presence, and quality of its
promotional items.

2.   Discuss the overall significance of the model, significance of
each predictor (or covariate), and the interpretation of any design
variable contained in the model.

3.   Can you improve upon this full regression model by removing any
predictor that is not significantly related to a company's overall
image?  If so, what is the new, more efficient regression

4.   Compare the two models obtained from (1) and (3) above in terms
of the overall fit, the significance of predictors, the
interpretability of the model, the predictive power of each model,
etc.

DATA DESCRIPTION:

A number of research reported in the past that an organization's image
was thought to
play an important role in the organization's financial performance.
In this context, corporate image refers to an established reputation,
symbol, or market presence that contribute to a business's changing
financial power.

Many pharmaceutical companies spend millions of dollars each year
promoting their companies image and products to physicians in
convention
settings.  These conventions are important to pharmaceutical companies
since they provide access to physicians who are otherwise difficult to
communicate with or inaccessible outside of conventions.
Understanding
physicians' perceptions of image and the factors that influence these
perceptions can enable companies to optimize promotional strategies,
and
thus increase corporate image and performance. The purpose of this
study is to identify the antecedents of pharmaceutical corporate image
in a convention setting, and to estimate
the effects of these antecedents on the overall image of a company.

Sample : 200 high prescribing psychiatrists attending the 149th Annual
American Psychiatric Association Convention.  These psychiatrists are
identified as doctors who usually prescribe medicine as a part of
their treatment.  They usually have various options from which to
choose in their prescription and influence companies' performance
through prescription activities that may be influenced by company's
image.

CODING SHEET FOR VARIABLES:

Dependent Variable:  A company's overall image (0= bad, 1= good)

Independent Variable:

1. promotion of current products  (0= bad, 1= good)
2. promotion of medications under development (0= bad, 1= good)
3. sales representatives   (0= bad, 1= good)
4. sponsorship of scientific/clinical symposia and educational
program   (0= bad, 1= good)
5. corporate hospitality and entertainment (0= bad, 1= good)
6. exhibition presence   (0= bad, 1= good)
7. promotional items     (0= bad, 1= good)

Feedback on the first project paper
Y750, Spring 2001
Instructor: Joanne Peng

1. The letter grades on 7 papers ranged from XXX to XXX.  The grades
were assigned based on three criteria:
a. accuracy of results
b. completeness of answers, and
c. writing style of the paper.

Feedback from the second reader, as I discovered, was very thorough
and insightful.  I wish to publicly thank our second readers in our
midst:

Dr. Jack Schmit, Dr. Deborah Carter, Mika Omori, Shimon Sarraf, and
Harry So (Math/Stat. center consultant).

2. You all demonstrated more than adequate level of knowledge about LR
modeling and any computer programming necessary to implement this
model in a data set.

3. Due to the data structure (which is beyond your control), some had
more exciting modeling processes/expressions to report than others.
This factor however did not impact your grade in any way or form.

4. Below are some observations of my own which would have improved
your first paper:

* descriptive analyses of covariates and their relationships with the
outcome variables should have been included.  All papers except for
two presented any information which described the data set on all
variables involved.

* The event category of the outcome variable should be correctly and
clearly identified.  Only one paper discussed this.  You cannot assume
that SAS or SPSS knows the category of DV that you tried to model
using LR.

*coding/recoding/dummy coding of all covariates should be thoroughly
and clearly discussed and presented.

* When dummy coding, such as 1 for yes, and 2 for no, was in the
original data set, you need to make a judgment as to how reasonable
this is for explaining the outcome of the event.  Maybe it is better
to recode such a variable into (1 for yes, and 0 for no.)

* SC and AIC need to be included, especially when you are comparing
models.

* The difference in -2LL between competing models (full vs. simple)
needs to be presented and discussed as evidence to support your choice
of the final model.

* When chi-square test is already significant, say at .001 level,
there is no need to compare the magnitude of such a test statistic,
since each is associated with a difference df.  The same can be said
about the chi-square test of slope coefficients, unless they are
associated with the same variable or variables of comparable
psychometric scales.

* A bit of introduction (or background information) and review of
relevant literature should help.

* If a pre-designated alpha level is used to determine the
significance of various tests, then such a level should be made known
from the onset.

* When a covariate is forced into a model for practical significance,
you need to tell your readers why this is the case and what you mean
by "practical" significance.

* Interpretations of the output and results should include the model
you finally chose, the meanings of slope coefficients, the predictive
power of such a model, any evidence to support your choice of this
final model over other competing model(s).

Example #1 of Second Take-Home Project in Y750
for XXXXXXX
Deadline: XXXX, 2001

PROJECT EXERCISES:

1. Given the data below, fit a full logistic regression model so that
an oak tree's sprouting presence, one year following harvesting, is
predicted from (1) the tree species, (2) the diameter of tree measured
(dbh), (3) the age of the tree at harvesting, (4) the site index and
any meaningful two-way interaction among these covariates.   This step
should include properly scaled transformations of continuous
covariates and identification of "meaningful" interaction terms.

2. Acess the fit of the full model in terms of its goodness of fit
statistics and ill-fitted data points.

3. Improve upon the full model by model-building strategies of your
choice which are implemented in either SPSS, SAS, BMDP, or Minitab.

4. Interpret the reduced model derived from (3) above in terms of
odds-ratio associated with subgroups, any covariate acting either as
an effect-modifier or confounder, suggestion for removing certain data
points from data in order to better fit the model, goodness of fit
statistic, and diagnostic statistic. etc.

DATA DESCRIPTION:

The data set contains information collected from 9 timber stands
located in southern Indiana on the Hoosier National Forest.  2184 oak
trees were measured.  Pre-harvest field work began in June of 1987 and
was completed in November 1987.  At that time, species identification,
dbh, and site index were obtained.  Immediately following harvest, age
was obtained.  One year following harvest, sprouting presence was
obtained.  The harvesting was completed over a two year period, so
roughly half the stands were measured one year and the other half
measured the following year; however, the sprouting presence was noted
one year following the harvest.

The data were collected in order to assist in researchers' ability to
model and estimate the contribution of oak stump sprouts to future
stand stocking.

Dependent variable: sprouts

Independent variables: species, dbh, age, and site index.

Variable Names                     definitions and coding

TREE_NOID number

SPECIESSpecies of each tree
1=white oak
3=black oak
4=scarlet oak
5=red oak
18=chestnut oak

DBH  Diameter of tree measured 4.5 feet above ground level (in cm).

AGE  Age of tree at time of harvest approximately 1 foot above ground
level (years).

SI  Site index, an indication of site quality, the higher the number
the better the site.  It is the height that a tree is expected to
attain in 50 years.

SPROUTS  Presence of sprouts 1 year after harvest.
0=absent
1=sprout or sprouts present

Example #2 of Second Take-Home Project in Y750
for XXXXXXX
Deadline: XXXX, 2001

PROJECT EXERCISES:

1.  Given the data below, fit a logistic regression model so that an
IU undergraduate student's decision to be enrolled at IU is predicted
from the remaining 8 variables, including students' demographic,
ability, IU contacts prior to admissions and any meaningful two-way
interaction among these covariates.   This step should include
properly scaled transformations of continuous covariates and
identification of "meaningful" interaction terms.

2.  Access the fit of the full model in terms of its goodness of fit
statistics and ill-fitted data points.

3.  Improve upon the full model by model-building strategies of your
choice which are implemented in either SPSS, SAS, BMDP, or Minitab.

4.  Interpret the reduced model derived from (3) above in terms of
odds-ratio associated with subgroups, any covariate acting either as
an effect-modifier or confounder, suggestion for removing certain data
points from data in order to better fit the model, goodness of fit
statistic, and diagnostic statistic. etc.

Outcome Variable: Enrollments of non-resident IU students who started
their programs in the fall of 1998.

Explanatory Variables:

Demographic Indicators
Ability Score: continuous, T score weighted 50% on both HS rank and
ACT Composite Score
Gender: dichotomous
Ethnicity:  white (referent), black, Hispanic, other
Contact Indicators
Interview: dichotomous, interview with IU alum or admissions counselor
Preview:  dichotomous, received Preview document
HS Visit: dichotomous
SIEmail:  dichotomous, student initiate email
Mail:  dichotomous, received IU mailing

Feedback on the second project paper
Y750, Spring 2001
Instructor: Joanne Peng

1.  The letter grades on 6 papers ranged from XXX to XXX.  The grades
were assigned based on three criteria:
d.  accuracy of results/interpretations,
e.  completeness of answers, and
f.  writing style of the paper.

Feedback from the second reader, as I discovered, was very thorough
and insightful.  I wish to publicly thank our second readers in our
midst:

Dr. Jack Schmit, Mika Omori, Shimon Sarraf, and Harry So (Math/Stat.
center consultant).

2.  You clearly demonstrated excellent knowledge about LR model
building strategies and computer programming skills necessary to
complete the task on a real-world data set.

3.  Due to data structure (which is beyond your control), some had
more complex steps in model building process to report than others.
This factor however did not impact your grade in any way or form.

4.  Below are some observations of my own which would have improved
your second paper:

* Continuous variables are best left as continuous, unless you have
strong reasons to believe that a scale transformation such as log,
square, quartiles, dichotomy, etc., is needed.

*  If a scale transformation was performed, you need to report the
rationale, or empirical evidence as to why certain type of
transformation is needed or desired, over other types.

*  P-levels for each stage of model building need to be clearly
stated.  In class some weeks ago, I outlined an action plan for model
building steps.  In that plan, I recommended different p-levels for
different stages in the process.  For example, 10% level is considered
acceptable when each covariate in a multivariate model is evaluated.
Please review various criteria for p-levels when writing up a report
such as your second project paper.

*  Need to clearly state what goodness-of-fit statistics were examined
and what criteria were used to judge/interpret those values.

*  As model building progressed, you undoubtedly were saddled with
many possibilities.  When these possibilities are presented in print,
it will be very helpful to give a code to each model such as (A), (B),
(C), etc, or (I), (II), (III), etc.

*  A table comparing all possible models will be helpful.

*  Inclusion of ROC curve and the graph of sensitivity against (1-
specificity) is desirable.
*  The use of individual observations or covariate patterns in
diagnostic statistic should be made clear.

*  Cut-off points used in evaluating the diagnostic values should be
made clear.
Interpretation of the chosen model should include: (a) the
mathematical functional form of the  model, (b) the significance of +
or – sign of each covariate, (c) odds ratio, (d) certain profiles of
cases, such as sex=male, age=17, SES=low, etc. and their predicted
logits or probabilities.

Possible Topics For The Final Paper

Choosing between logistic regression and discriminant analysis
ref included—

a.  Flury's chapter on LR and DA
b.  Efron (1975). The efficiency of logistic regression compared to
normal discriminant analysis, JASA, 70, 892-898.
c.  Press and Wilson. (1978). Choosing between logistic regression and
discriminant analysis.  JASA, 73, 699-705.

*  Comparison of loglinear modeling and logistic regression

*  Detecting differential item functioning (DIF) using logistic
regression procedures

*  The use and evaluation of polytomous logistic regression models
–start from Chapter 8 of Hosmer and Lemeshow.

*  Comparison of logit model and probit model

*  The use of logistic regression reported in two referred journals of
your field of interest from 1988-98.

Read the guidelines as follows:

Outlines of Review of Papers Using LR in  Referred Journals of Your
Field
[in c:\u\y750\projects\review.outline.doc, 3/19/99]

Introduction—reasons for the review

*There is a increasing trend of using LR in XXX articles due to
a.  complex data structure
b.  massive amount of data
c.  dichotomous outcomes such as persist vs. non-persist, drop-out vs.
stayer, admit vs. no-admit, private vs. public, etc.

* helpful review articles such as the one by Cabrera (1994, #2 in the
reading packet) and another by Austin and Hinkle (1992, title "LR in
Higher Ed.").

*  increasing use of LR in AERA annual programs, e.g. Division J (?)
on higher ed.

1.  Journals, period (1988-1998), articles reviewed

2.  research questions addressed - link this section to (0.
Introduction) make the following issues concrete using the articles
reviewed:
a.  complex data structure—mixture of categorical variables and
continuous variables.
b.  massive amount of data
c.  dichotomous outcomes such as persist vs. non-persist, drop-out vs.
stayer, admit vs. no-admit, private vs. public, etc.

3.  types of modeling (mentioned in analysis section?)
forced entry, sequential (not stepwise, more like block), interaction
terms (not hierarchical models), stepwise

4.  issues to be concerned in designs
a.  over-sampling (Ott, #12) - solution, Recommendation (in English)
b.  small sample size for ML estimators, relative to no. of covariates
(Wilson and Hardgrove, 1995 article, #21)
c.  extremely small or large p (baseline) in the data - results in (a)
no diagnostic, (b) needs to overlap between 2 categories, this can be
ok if dealing with large sample size.
d.  over-dispersion problem (i.e., the proportion of event outcomes is
not even across clusters, households, families, neighborhoods is not
even.  Thus, the baseline p varies widely across these clusters.)
Solution: (a) subset analysis, (b) include higher order terms, and (c)
SAS correction formula.

5.  issues to be concerned in analysis (LR) and reporting/presentation
of results:
(1)  which packages used (SAS, SPSS…)
(2)  dummy coding, input formats..
(3)  models used (forced entry, sequential, stepwise, hierarchical)
(4)  goodness of fit statistics (lacking or incorrect)…

Inferential
a.  score test
b.  chi-square (overall)
c.  chi-square (beta's)
d.  chi-square (H/L)
e.  Brown statistics

Descriptive
a.  AIC
b.  SC
c.  R square and rescaled R squared
d.  concordant pairs
e.  discordant pairs
f.  % of correct classification
g.  sensitivity
h.  sphericity
i.  false positive
j.  false negative
k.  Somer's D
l.  Gamma
m.  c-statistics
(5)  interaction and confounding between a categorical variable and a
continuous.
(6)Interpretation: [deviance and (Stage) mixed logit with prob.]
a.  significant coefficient (delta p..)
b.  prediction (Stage): logit vs. probability, cutoff points specified
c.  Odds ratio
--Odds ratio = exp(slop coefficient).  In terms of the meaning of
these odd ratios, take the age variable for example.  Hence, holding
all other variable values constant, the odd ratio means the change in
the odds (or p/1-p) of y for each unit change in its corresponding X.
The relationship between p and odds, though positively related but,
are not linearly related obviously.

Three conditions must be met before odds ratios can be interpreted:
(a) the explanatory variable does not interact with any other
variable.
(b) the explanatory variable is represented  by a single term in the
model.
(c) a one-unit change in the explanatory variable is meaningful and
relevant.

SAS options can help you realize the change in odds with 10 or 100
units change in X and also the 95% CI on the odds ratios using either
the likelihood ratio method or the Wald method.  These can be obtained
from SAS using appropriate keywords referred to my solution printout
to the ICU.SAS exercises from Ch.2)

d.  Kappa = (p-pc)/(1-pc)]  where pc = the baseline  p

7.  Diagnostic statistics calculated, performed and actions taken.